PORT2 Document Number Deliverable D4.3 Project ref. IST-2001-34460 Project Acronym MEANING Project full title Developing Multilingual Web-scale Language Technologies Project URL http://www.lsi.upc.es/˜nlp/meaning/meaning.html Availability Public Authors: Jordi Atserias (UPC), Montse Cuadros (UPC), Eva Naqui (UPC), German Rigau (UPV/EHU) INFORMATION SOCIETY TECHNOLOGIES
64
Embed
PORT2 - UPC Universitat Politècnica de Catalunyanlp/meaning/documentation/... · Port2 Page : 8 1.4 Uploading Process To upload correctly all this di erent knowledge into a single
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
PORT2
Document Number Deliverable D4.3Project ref. IST-2001-34460Project Acronym MEANINGProject full title Developing Multilingual Web-scale Language TechnologiesProject URL http://www.lsi.upc.es/˜nlp/meaning/meaning.htmlAvailability PublicAuthors: Jordi Atserias (UPC), Montse Cuadros (UPC), Eva Naqui (UPC),German Rigau (UPV/EHU)
INFORMATION SOCIETY TECHNOLOGIES
WP4-Deliverable D4.3 Version: FINALPort2 Page : 1
Project ref. IST-2001-34460Project Acronym MEANINGProject full title Developing Multilingual Web-scale
Language TechnologiesSecurity (Distribution level) PublicContractual date of delivery February 2004Actual date of delivery March 16, 2005Document Number Deliverable D4.3Type ReportStatus & version v FINALNumber of pages 62WP contributing to the deliberable WP4WPTask responsible German RigauAuthors
Jordi Atserias (UPC), MontseCuadros (UPC), Eva Naqui(UPC), German Rigau(UPV/EHU)
Other contributorsReviewerEC Project Officer Evangelia MarkidouAuthors: Jordi Atserias (UPC), Montse Cuadros (UPC), Eva Naqui (UPC),German Rigau (UPV/EHU)Keywords: Multilingual Central Repository, EuroWordNet, WordNetAbstract: This document describes the third version of the Multilingual Cen-tral Repository (Mcr2) and the third Porting process (PORT2). We describe theknowledge uploaded and integrated into Mcr2, including a brief description of ageneral Upload/Porting architecture. Finally, we provide a full description of thethird Porting process. The current version of the MCR integrates 1.642.389 uniquesemantic relations between concepts (ILI-records). This represents one order ofmagnitude larger than the Princeton wordnet (138.091 unique semantic relationsin WN1.6). Furthermore, the current MCR have been also enriched with 466.972semantic properties. In fact, the resulting Mcr2 is the largest and richest multi-lingual lexical–knowledge ever built. In that way, the Mcr produced by Meaning
is going to constitute the natural multilingual large-scale linguistic resource for anumber of semantic processes that need large amounts of linguistic knowledge to beeffective tools.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
WP4-Deliverable D4.3 Version: FINALPort2 Page : 4
1 Introduction
This document describes Mcr2, the third version of the Meaning Multilingual CentralRepository. The Multilingual Central Repository (Mcr) acts as a multilingual interfacefor integrating and distributing all the knowledge acquired by Meaning.
1.1 EuroWordNet architecture
The Mcr follows the model proposed by the EuroWordNet project. EuroWordNet [Vossen,1998] is a multilingual lexical database with wordnets for several European languages,which are structured as the Princeton WordNet [Fellbaum, 1998].
The Princeton WordNet contains information about nouns, verbs, adjectives and ad-verbs in English and is organized around the notion of a synset. A synset is a set of wordswith the same part-of-speech that can be interchanged in a certain context. For example,<car, auto, automobile, machine, motorcar> form a synset because they can be used torefer to the same concept. A synset is often further described by a gloss: ”4-wheeled;usually propelled by an internal combustion engine”. Finally, synsets can be related toeach other by semantic relations, such as hyponymy (between specific and more generalconcepts), meronymy (between parts and wholes), cause, etc.
Figure 1 gives a schematic presentation of the EuroWordNet architecture. In the middle,the language-independent structures are given: the Ili, a Domain Ontology and a TopConcept Ontology. The Ili consists of a list of so-called Ili-records which are relatedto word-meanings in the local wordnets, (possibly) to one or more Top Concepts and(possibly) to domains.
Some language-independent structuring of the Ili is nevertheless provided by two sep-arate ontologies, which may be linked to Ili records:
• the Top Concept ontology, which is a hierarchy of language-independent concepts,reflecting important semantic distinctions, e.g. Object and Substance, Location,Dynamic and Static;
• a hierarchy of domain labels, which are knowledge structures grouping meanings interms of topics or scripts, e.g. Traffic, Road-Traffic, Air-Traffic, Sports, Hospital,Restaurant;
Both the Ontological properties and the Domain Labels can be transferred via theequivalence relations of the Ili-records to the local wordnet meanings, as is illustrated inFigure 1. The Top Concepts Location and Dynamic are for example directly linked tothe Ili-record drive and therefore indirectly also apply to all language-specific conceptsrelated to this Ili-record. Via the local wordnet relations, the Top Concept can be furtherinherited by all other related language-specific concepts.
The main purpose of the Top Ontology is to provide a common framework for themost important concepts in all the wordnets. It consists of 63 basic semantic distinctionsthat classify a set of 1.601 Ili-records representing the most important concepts in the
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
WP4-Deliverable D4.3 Version: FINALPort2 Page : 5
Figure 1: EuroWordNet architecture
different wordnets1. The classification was verified by the different EuroWordNet partners,so that it holds for all the language-specific wordnets. In section 3.4.2, we will furtherdescribe the Top Ontology used in Meaning.
The Domain Hierarchy group concepts in a different way, based on scripts rather thanclassification. For instance, grouping together concept nouns non-hierarchically relatedsuch as hospital, doctor, operation, together with concepts from other part–of–speechsuch as to operate. This is a powerfull tool to control the ambiguity problem in NaturalLanguage Processing. In section 3.4.3, we will further describe the Domain hierarchy usedin Meaning.
1.2 Meaning
Meaning works with five wordnets corresponding to five European languages (Basque,
1These represent the current set of Base Concepts based on WordNet 1.6. The original set fromEuroWordnet based on wordnet 1.5 had 1.030 Base Concepts
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
WP4-Deliverable D4.3 Version: FINALPort2 Page : 6
Catalan, English, Italian and Spanish). The Mcr acts as the sense inventory for nouns,verbs, adjectives and adverbs for all the languages involved in the project. All these lan-guages realise the meaning in different ways and Meaning will benefit from that becausethese wordnets have been constructed following the model proposed by the EuroWordNetprojects. That is, the wordnets are linked to an Inter-Lingual-Index (Ili). Via this index,the languages are interconnected so that it is possible to go from the words in one languageto similar words in any other language connected. The Ili is a set of meanings, mainlytaken from Princeton WordNet. The only purpose of the Ili is to mediate between thesynsets of the local wordnets. Each synset in the local wordnets has at least one equiva-lence relation with a record in this Ili, either directly or indirectly via other related synsets.Language-specific synsets linked to the same Ili-record should thus be equivalent acrossthe languages.
The development of Meaning is organized in three consecutive cycles. Figure 2 sum-marises the Meaning data flow. Each Meaning development cycle consists of:
• WP6 (WSD): Word Sense Disambiguation systems (WSD0, WSD1, WSD2) usingthe local wordnets and the enriched knowledge ported from the Multilingual CentralRepository.
• WP5 (Acquisition): Local acquisition of knowledge using specially designed toolsand resources, corpus and wordnets (ACQ0, ACQ1, ACQ2).
• WP4 (Knowledge Integration): Uploading the acquired knowledge from each lan-guage into the Multilingual Central Repository and porting to the local wordnets(PORT0, PORT1, PORT2).
Meaning will have three consecutive processes for uploading and porting the knowl-edge acquired from each language to the respective local wordnets: PORT0, PORT1,PORT2. The knowledge acquired locally will be uploaded and ported across the rest oflanguages via the EuroWordNet Ili, maintaining the compatibility among them. Theknowledge acquired from each language during the three cycles will be consistently up-load into the Mcr, granting the integrity of all the data produced by the project. Aftereach Meaning cycle, all knowledge acquired and integrated into the Mcr will be thendistributed across the local wordnets.
In that way, the Ili structure (including the Top Ontology and the Domain Hierarchy)will act as a natural backbone to transfer the different knowledge acquired from each localwordnet to the rest of wordnets.
Meaning has been developed the Mcr to maintain compatibility between wordnets ofdifferent languages and versions, past and new. The Ili should itself be connected to newerversions of WordNet or extensions of the Ili [Sofia et al., 2002a; Sofia et al., 2002b] usingthe technology for the automatic alignment of different large-scale and complex semanticnetworks [Daude et al., 1999; Daude et al., 2000; Daude et al., 2001] (see also WorkingPaper WP4.3 Making wordnets compatible).
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
WP4-Deliverable D4.3 Version: FINALPort2 Page : 7
Figure 2: MEANING data flow
1.3 Knowledge Integration
The third version of Mcr, Mcr2 integrates five local wordnets (including five versions ofthe English Princeton WordNet and the eXtended WordNet), the Suggested Upper MergedOntology, the EuroWordNet Top Ontology, WordNet Domains, and large set collectionsof Selectional Preferences and Instances (see Working Papers WP3.3, WP4.1 and WP4.4and deliverables D2.1 and D2.2 for a complete description of these knowledge resources).In order to carry out this integration process, several tasks have been performed:
1. the uploading,
2. the integration and finally,
3. the porting of all this knowledge to the local wordnets.
The first two tasks have been extensively described in Working Paper WP4.4 Uploading1. Here we provide a summary.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
WP4-Deliverable D4.3 Version: FINALPort2 Page : 8
1.4 Uploading Process
To upload correctly all this different knowledge into a single multilingual repository a verycomplex and delicate process must be performed. Once finished the first part of uploadingthe data released by the different partners (just checking errors and inconsistencies), amore complex second part must be performed. This second part consist of the correctintegration of every piece of information into the Mcr. That is, linking correctly all thisknowledge to the Ili. This second part involves a complex cross checking validation processand usually a complex expansion/inference of large amounts of semantic properties andrelations through the semantic structure.
Working Paper WP4.2 Upload 0 and WP4.4 Upload 1 explains in detail the two previousuploading processes performed in the previous two Meaning cycles. Now, Working PaperWP4.6 Upload 2 describes the last uploading process performed in the third Meaning
cycle.Next sections, will also describe other new resources which has been uploaded in this
third round, e.g. Improved WordNet Domains (2nd release) [Magnini and Cavaglia, 2000],Base Concepts (2nd release) [Vossen, 1998], EuroWordNet Top Concept Ontology (2ndrelease) [Vossen, 1998], VerbNet [Kipper et al., 2000].
1.5 Integration Process
Once all this data is correctly uploaded into the Mcr, two different process have beendevised: realization and generalization. Both processes seems to be very promising. How-ever, in the third round of Meaning we only performed the realization of the new versionof the Top Concept Ontology (see Working Paper WP4.7 for further details). Obviously,both processes require further investigation and exploitation.
1.5.1 Realisation
Once all this data is uploaded into the Mcr, it is possible to perform a full expansionprocess of the Top Ontology properties through the nominal and verbal hierarchies.
Some of the selectional preferences acquired from SemCor and BNC can also be in-herited through the nominal part of the hierarchy. This process involves also a heavycomputational effort. In fact, some other WordNet relations can be derived in the sameway (for instance, meronym relations, etc.)
By integrating this knowledge, we are making explicit all knowledge contained into theMcr. It would be interesting to implement some capabilities to mechanise this process. Allinferred relations and knowledge can be rebuild several times during integration withoutlosing information or consistency.
1.5.2 Generalisation
A similar process can be devised in order to expand the knowledge into the Mcr. In thiscase, rather than expanding top–down the knowledge and properties represented into the
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
WP4-Deliverable D4.3 Version: FINALPort2 Page : 9
Mcr, a bottom–up generalisation mechanism can be performed. In this case, differentknowledge and properties can collapse on particular Base Concepts and ontological nodes.
1.6 Porting Process
Having all this types of different knowledge and properties completely expanded and cov-ering the whole Mcr, a new set of inference mechanism can be devised in order to furtherinfer new relations and knowledge. For instance, new relations can be generated when de-tecting particular semantic patterns occurring for some synsets having certain ontologicalproperties, for a particular Domains, etc. That is, new relations can be generated whencombining different methods and knowledge. For instance, when several relations derivedin the integration process have particular confidence scores greater than certain thresholds.
However, without this new inference tool (i.e. without having inferred extra knowl-edge) in this porting process all the knowledge integrated into the Mcr can be ported(distributed) to the local wordnets. That is, this process finish producing exporting XMLfiles for all local wordnets.
Thus, the current Mcr software include system modules for:
• Uploading the data acquired from one language to the Mcr.
• Porting the knowledge stored into the Mcr to the local wordnets.
• Checking the integrity of the data stored in the Mcr.
The fact that word senses will be linked to concepts in Mcr will allow for the appropri-ate representation and storage of the acquired knowledge. In that way, the Mcr producedby Meaning is going to constitute the natural multilingual large-scale linguistic resourcefor a number of semantic processes that need large amounts of linguistic knowledge to beeffective tools (e.g. Web ontologies).
After this introduction to the Mcr and the knowledge integration process, Section 2gives a general overview of the software components of the Mcr2. Section 3 summarises allknowledge uploaded into the third release of the Mcr. Section 4 summarizes the portingprocess and Section 5 provides some final figures of this process. Section 6 describes someplants to continue enriching the current Mcr with further knowledge. Section 7 illustratethe current content of the Mcr providing some examples. Finally, Section 8 provides someconcluding remarks.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
This section provides a brief summary of the different software components which arepart of the Mcr. Deliverable D4.1 includes the Database Design. See also Working PaperWP4.1 Basic Design of the Multilingual Central Repository for further details. The currentstatus of the software components are summarized in next sections. A complete descriptionis provided in Working Paper WP4.8 MCR software. Basically, the Web Interface, thedifferent APIs and the Import/Export and Statistical facilities.
2.1 WEI: a Web Interface to access the Mcr
The Mcr database has been implemented using MySQL. The Mcr provides a web interfaceto the database based on the Web EuroWordNet Interface (WEI)[Benıtez et al., 1998].The interface provides consulting and editing facilities for the data included into the Mcr.Meaning has provided a new release to access the Mcr database 2.
The basic aim of this tool is to provide a flexible access for editing and consulting Mcr.The Web EuroWordNet Interface (WEI) is a tool that provides to the user all the lexico-semantic information contained in all uploaded WordNets: English (versions 1.5, 1.6, 1.7,1.7.1, 2.0), Italian, Basque, Catalan and Spanish, etc.
Figure 3: Consulting WEI
WEI allows the user to consult the Mcr using a powerful but very intuitive userinterface. WEI provides facilities for a flexible querying of the Mcr. First, the usercan select how to enter to the Mcr by providing a word or a variant or a synset of any
wordnet uploaded into the Mcr. Then, the user must choose one of the wordnets tonavigate through some of its semantic relations. Finally, the user select which informationand from which wordnet whats to obtain the result of the consultation. Figure 3 shows thetypical consulting interface of WEI, accessing through the English WordNet 1.6 varianthouse 1 to the content of all its hypernyms in the Basque, Spanish, Italian and English2.0 WordNets, as well as all the information associated to the ILI (SUMO, Domains, TopConcept Ontology, Semantic File, Base Concept).
2.2 APIs
Three different APIs are been developed, first, a SOAP API to allow any remote user tointeract with the Mcr. The aim of this API is to provide the major accessibility to Mcr.Next, MCRQuery is an extension of wnQuery perl API for the Mcr to allow Princetonwordnet users to migrate easily to Mcr and thus make their application multilingual bymeans of a general API. And last but not least, a fast API on C++ for high performancesoftware.
The MCRQuery module allows us to easily adapt packages developed for the officialPrinceton wordnets, e.g. the perl WordNet similarity package to work with the Mcr
(instead of WordNet files). The WordNet similarity is a set of Perl modules that imple-ment the semantic relatedness measures described by Leacock Chodorow [Leacock andChodorow, 1998], Jiang and Conrath [Jiang and Conrath, 1997], Resnik [Resnik, 1995],Lin [Lin, 1998b], Hirst and St Onge [Hirst and St-Onge, 1998], Wu and Palmer [Wu andPalmer, 1994], the adapted gloss overlap measure by Banerjee and Pedersen [Banerjee andPedersen, 2003], and a measure based on context vectors by Patwardhan [Patwardhan,2003].
MCRquery is also used as a common abstraction to access Mcr or WordNet for othersoftware development inside the project. e.g. ExRetriever tools [Fernandez et al., 2004]
(See WP5.14 Experiment 5.F: Sense Examples (3rd round)).
2.3 Import/Export Facilities
It is not necessary to maintain the defined format of EuroWordNet, as long as a standard-ized format in XML is agreed with other projects involved in the development of wordnetsto maintain all data compatible. Meaning will set an XML standard format for wordnetdata and will provide methods for integrating data from other current standards: Prince-ton format [Fellbaum, 1998], EuroWordNet format [Vossen, 1998], VisDic format [Sofia etal., 2002a; Sofia et al., 2002b].
2.4 Advanced Analysis module
Advanced facilities will be also provided to explore the data and to analyse/mine themultilingual relations. This module will specially focus on the multilingual comparison ofthe data (i.e including facilities to made easy cross-lingual queries).
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
After PORT0, the first porting process preformed in the first cycle, Mcr0 included thefollowing large–scale resources:
• ILI
– Aligned to WordNet 1.6 [Fellbaum, 1998]
– EuroWordNet Base Concepts [Vossen, 1998]
– EuroWordNet Top Concept Ontology [Vossen, 1998]
– WordNet Domains version 070501 [Magnini and Cavaglia, 2000]
• Local wordnets
– English WordNet 1.5, 1.6, 1.7.1 [Fellbaum, 1998]
– Basque wordnet [Agirre et al., 2002]
– Italian wordnet [Pianta et al., 2002]
– Catalan wordnet [Benıtez et al., 1998]
– Spanish wordnet [Atserias et al., 1997]
• Large collections of semantic preferences
– Acquired from SemCor [Agirre and Martinez, 2001; Agirre and Martinez, 2002]
– Acquired from BNC [McCarthy, 2001]
• Instances
– Named Entities [Alfonseca and Manandhar, 2002]
See deliverable D2.1 3 (Basic Design of architecture and methodologies) and WorkingPaper WP4.2 4 (Upload 0) for an extended summary of each of these components andtheir uploading process. See deliverable D4.1 5 (PORT0) for a detailed report of the finalfigures after the first porting process.
For the second release of the Mcr, we planned to upload several new large-scale semanticresources into the Mcr (see deliverable D2.2 6 Basic Design of architecture and method-ologies (2nd round) for further details):
• Suggested Upper Merged Ontology (Sumo) [Niles and Pease, 2001]
• eXtended WordNet [Mihalcea and Moldovan, 2001]
• WordNet 2.0 [Fellbaum, 1998]
• Improved Selectional Preferences acquired from BNC [McCarthy, 2001]
• Direct dependencies form Parsed SemCor [Agirre and Martinez, 2001]
• Named Entities from Sumo [Niles and Pease, 2001]
• Named Entities from MultiWordNet [Pianta et al., 2002]
The resulting Meaning Mcr1 included:
• Ili
– Aligned to WordNet 1.6 [Fellbaum, 1998]
– EuroWordNet Base Concepts [Vossen, 1998]
– EuroWordNet Top Concept Ontology [Vossen, 1998]
– WordNet Domains version 070501 [Magnini and Cavaglia, 2000]
– Suggested Upper Merged Ontology (Sumo) [Niles and Pease, 2001]
• Local wordnets
– English WordNet 1.5, 1.6, 1.7, 1.7.1, 2.0 [Fellbaum, 1998]
– eXtended WordNet 1.7 [Mihalcea and Moldovan, 2001]
– Basque wordnet [Agirre et al., 2002]
– Italian wordnet [Pianta et al., 2002]
– Catalan wordnet [Benıtez et al., 1998]
– Spanish wordnet [Atserias et al., 1997]
• Large collections of semantic preferences
– Direct dependencies from Parsed SemCor [Agirre and Martinez, 2001]
– Acquired from SemCor [Agirre and Martinez, 2001; Agirre and Martinez, 2002]
– Acquired from BNC (2nd release) [McCarthy, 2001]
• Large collections of Sense Examples
– SemCor
• Instances
– Named Entities [Alfonseca and Manandhar, 2002]
– Named Entities [Niles and Pease, 2001]
– Named Entities [Pianta et al., 2002]
See deliverable D2.2 7 (Basic Design of architecture and methodologies) and WorkingPaper WP4.4 8 (Upload 1) for an extended summary of each of these components andtheir uploading process. See deliverable D4.2 9 (PORT1) for a detailed report of the finalfigures after the second porting process.
3.3 Content of Mcr2
Next sections, will also describe other new resources which has been uploaded in thisthird round. In this Meaning cycle we mainly uploaded and integrated new releases ofpreviously upload resources (e.g. Improved WordNet Domains (2nd release) [Magnini andCavaglia, 2000], Base Concepts (2nd release) [Vossen, 1998], EuroWordNet Top ConceptOntology (2nd release) [Vossen, 1998]). However, we also have integrated a new large-scaleresource: VerbNet [Kipper et al., 2000].
Since the first version of the Mcr, we decided to integrate into the Mcr only con-ceptual knowledge (semantic information relating or attached to synsets). This decisionhad several implications. For instance, not all the large–scale knowledge acquired fromWP5 ACQ have been uploaded and ported (e.g. subcategorization frequencies, topic sig-natures, terminology and domain information). This knowledge is maintained into thelocal wordnets.
After PORT2, the final content of Mcr2 should include:
• ILI
– Aligned to WordNet 1.6 [Fellbaum, 1998]
– Base Concepts (2nd release) [Vossen, 1998]
– Top Concept Ontology (2nd release) [Vossen, 1998]
– MultiwordNet WordNet Domains (2nd release) [Magnini and Cavaglia, 2000]
– Suggested Upper Merged Ontology (Sumo) [Niles and Pease, 2001]
– English WordNet 1.5, 1.6, 1.7, 1.7.1, 2.0 [Fellbaum, 1998]
– eXtended WordNet [Mihalcea and Moldovan, 2001]
– Basque wordnet [Agirre et al., 2002]
– Italian wordnet [Pianta et al., 2002]
– Catalan wordnet [Benıtez et al., 1998]
– Spanish wordnet [Atserias et al., 2004b]
• Large collections of semantic preferences
– Direct dependencies from Parsed SemCor [Agirre and Martinez, 2001]
– Acquired from SemCor [Agirre and Martinez, 2001; Agirre and Martinez, 2002]
– Acquired from BNC (2nd release) [McCarthy, 2001]
• Predicate structure
– VerbNet [Kipper et al., 2000]
• Instances
– Named Entities [Alfonseca and Manandhar, 2002]
– Named Entities [Niles and Pease, 2001]
– Named Entities [Pianta et al., 2002]
See deliverable D2.3 10 (Basic Design of architecture and methodologies) and WorkingPaper WP4.6 11 (Upload 2) for an extended summary of each of these components andtheir uploading process.
Next sections will provide a short summary of each of the above Mcr components.
3.4 Meaning Inter-Lingual-Index
As in previous cycles, Meaning use Princeton WordNet 1.6 as Ili for Mcr2. This decisionminimize the effect of porting errors when the knowledge acquired from one language hasbeen ported to other wordnets. Initially most of the knowledge acquired has been derivedfrom WordNet 1.6 (selectional preferences from SemCor and BNC) and the Italian WordNetand the WordNet Domains, both developed at IRST are using WordNet 1.6 as Ili [Piantaet al., 2002; Magnini and Cavaglia, 2000].
However, the Ili for Spanish, Catalan and Basque wordnets was WordNet 1.5 [Atseriaset al., 1997; Benıtez et al., 1998]. Meaning applied the technology for mapping accurately
wordnet versions 12, but some hundreds of links received multiple choices (see WorkingPaper WP4.3 Making wordnets compatible for further details).
To solve this version gap and in order to minimize side effects with respect otherEuropean initiatives (Balkanet, EuroTerm, etc.) and wordnet developments around GlobalWordNet Association, Meaning provided a revised version of the automatic mapping fromWordNet 1.5 and WordNet 1.6.
As the new versions connected to WordNet 1.6 of Spanish, Catalan and Basque was noterror free, we suggested to perform a complete revision of variant to synset connections.This revision was performed for Spanish and reported in [Atserias et al., 2004b].
However, further research is also needed to locate automatically dubious mapping areas(see Working Paper WP4.3 Making wordnets compatible for further details).
3.4.1 EuroWordNet Base Concepts
The overall design of the EuroWordNet database made it possible to develop the localwordnets relatively independently while guaranteeing a high level of compatibility. Nev-ertheless, some specific measures were taken to enlarge the compatibility of the differentresources:
1. The definition of a common set of so-called Base Concepts that was used as astarting point by all the sites to develop the cores of the wordnets. Base Conceptsare meanings that play a major role in the wordnets.
2. The classification of the Base Concepts in terms of the Top Ontology.
The main characteristic of the Base Concepts is their importance in wordnets. Ac-cording to our pragmatic point of view, a concept is important if it is widely used, eitherdirectly or as a reference for other widely used concepts. Importance is thus reflected inthe ability of a concept to function as an anchor to attach other concepts. This anchor-ing capability was defined in terms of three operational criteria that can be automaticallyapplied to the available resources:
• the number of relations (general or limited to hyponymy).
• high position of the concept in a hierarchy
• being widely used by several languages
The procedure of selecting the EuroWordNet Base Concepts and the Top Ontology isdiscussed in [Vossen, 1998]. The final set of common Base Concepts from WordNet 1.5 hasbeen also mapped to WordNet 1.6. We have provided to each wordnet developer a list ofnot covered Base Concepts.
12http://www.lsi.upc.edu/~nlp/tools/mapping.html
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
We also plan to compare our results not only with our set of Base Concepts (comingfrom the initial set of EuroWordNet), but also with the set of Base Concepts produced bythe BalkaNet Project and the pre-released Base Concept from Princeton WordNet.
We are also planning to provide an automatically constructed set of Base Concepts.This will require to find a formal criteria to detect the most appropriate synsets which bestrepresent the most important concepts of WordNet. This criteria could be based on:
• Frequency count of the synset in Semcor
• Number of descendants
• Conceptual Density
• Changes of Top Concept Ontology/Domain/SUMO properties of adjacent concepts
Table 1 shows as an example, the possible Base Concepts that could represent ap-propriately the different senses of the noun “Church”. Ascending through the hypernymchain for each sense, we can locate the local maxima using different criteria, for instance,for each synset the number of relations or the number of occurrences in SemCor. Forchurch 1 the occurrence-criteria would select Christianity 2, organisation 2 and group 1while the relation-criteria would select faith 3 and organisation 2. For the second sense ofchurch, church 2, the occurrence-criteria will select church 2, construction 3 and object 1while the relation-criteria would select church 2 and building 1. Finally, for church 3 theoccurrence-criteria would select service 3 and activity 1 while the relation-criteria wouldselect religious ceremony 1 and activity 1. Obviously, different criteria will select a differ-ent set of Base Concepts. However, it is important to notice, that both criteria produce avery similar set of Base Concepts (in all cases, only one level of difference).
The new set of Base Concepts provided by this method should be also compared withthose currently uploaded into the Mcr (or coming from Balkanet or Princeton). More-over, we must remark that the Base Concepts are the synsets that are used to assignthe Top Concept Ontology properties. For instance, following SUMO, church 1 is a Reli-giousOrganization+ (it has also the Top Concept Ontology properties Human+, Group+and Function+13 and the WordNet Domain Religion); church 3 is a ReligiousProcess+(having the Top Concept Ontology properties agentive+, Cause+, Dynamic+ and Pur-pose= and also the WordNet Domain Religion); however, church 2 is a Building+, notReligiousBuilding+ (although this synset has Top Concept Ontology properties Artifact+,Building+, Object+ and also a WordNet Domain Religion).
Obviously, we also suggest further investigation on the possibility to obtain automati-cally a new set of Base Concepts, and on the possibility to attach new ontological propertiesto them, and finally, on how to characterize the current ontological properties uploadedinto the Mcr as semantic roles for predicates.
In EuroWordNet, the Base Concepts were classified by the Top Ontology using 63semantic distinctions. This ontology, which functions as a common framework for all thewordnets, is briefly described in the next section.
13As SUMO does, = stands for an assigned label and + for an inherited label
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
The EuroWordNet Top Ontology consists of 63 higher-level concepts, excluding the top.Following [Lyons, 1977] EuroWordNet distinguished at the first level 3 types of entities:
• 1stOrderEntity Any concrete entity (publicly) perceivable by the senses and locatedat any point in time, in a three-dimensional space, e.g.: vehicle, animal, substance,object.
• 2ndOrderEntity Any Static Situation (property, relation) or Dynamic Situation,which cannot be grasped, heard, seen, felt as an independent physical thing. Theycan be located in time and occur or take place rather than exist, e.g.: happen, be,have, begin, end, cause, result, continue, occur..
• 3rdOrderEntity Any unobservable proposition which exists independently of timeand space. They can be true or false rather than real. They can be asserted ordenied, remembered or forgotten, e.g.: idea, thought, information, theory, plan.
The purpose of the EuroWordNet Top Concept ontology was to enforce more unifor-mity and compatibility of the different wordnet developments. However, the Eurowordnetproject only performed a complete validation of the consistency of the Top Conceptontology of the Base Concepts.
Although, the classification of WordNet is not always consistent with the Top Conceptontology, we performed an automatic expansion of the Top Concept properties assignedto the Base Concepts. That is, we enriched the complete Ili structure with features comingfrom the Base Concepts by inheriting the Top Concept features following the hyponymyrelationship (see Working Paper WP4.5 Towards de MEANING Top Ontology).
Assuming (as the builders of Sumo and WordNet Domain have done) that the onto-logical properties have been correctly assigned to particular synsets and WordNet definescoherent ontological subsumption chains across taxonomies, an automatic process can con-sistently inherit all the properties through the whole hierarchy of WordNet - no matter theontology they come from.
In Meaning we have performed an automatic expansion of the Top Concept Ontologyproperties assigned to the Base Concepts. That is, we enriched the complete Ili structurewith features coming from the Bc by inheriting the Top Concept features following thehyponymy relationship.
This way, once properties are exported to the Ili and inherited through the wholeWordNet hierarchy, all concepts in a WordNet will result to be assigned with a set ofsemantic features as in the following example.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
As the classification of WordNet is not always consistent with the Top Concept Ontol-ogy, the incompatibilities of the properties impeded the full automatic top–down propaga-tion of the Top Concept Ontology properties. That semi-automatic process resulted in anumber of synsets showing non–compatible information. Specifically:
• Sticking to Top Concept Ontology and according to the set of incompatibilities, someTop Concept Ontology properties assigned by hand appeared to be incompatiblewith either (a) inherited information, (b) information assigned via equivalence tothe Semantic Files (Lexicographical Files from WordNet) or/and even (c) other TopConcept Ontology properties assigned by hand.
• Top Concept Ontology properties, either original or inherited, are suspicious to beincompatible with other ontologies currently uploaded into the Mcr.
By examining a subset of synsets, we realised that there are at least the following mainsources of errors:
• Erroneous hand-made Top Concept Ontology mappings
• Erroneous statements of equivalence between Top Concept Ontology properties andSemantic Files
• Erroneous ISA links in WordNet -which causes erroneous inheritance
• Multiple inheritance within WordNet can cause incompatibilities in inheritance ofproperties [Guarino and Welty, 2000]
We can see an example of incompatible information in the following example, where a3rdOrderEntity can not coexist with properties only attributable to Events:
The initial EuroWordNet design included a Domain ontology. However, only the ComputerDomain was included into the EuroWordNet database.
Information brought by Domain Labels is complementary to what is already in Word-Net. First of all a Domain Labels may include synsets of different syntactic categories: forinstance MEDICINE groups together senses from nouns, such as doctor and hospital, andfrom Verbs such as to operate.
Second, a Domain Label may also contain senses from different WordNet subhierar-chies (i.e. deriving from different unique beginners or from different lexicographer files.For example, the SPORT contains senses such as athlete, deriving from life form, gameequipment, from physical object, sport from act, and playing field, from location.
Meaning use WordNet Domains [Magnini and Cavaglia, 2000] which were partiallyderived from the Dewey Decimal Classification 14. WordNet Domains is a hierarchy of 165Domain Labels associated to WordNet 1.6 synsets.
3.4.4 Suggested Upper Merged Ontology (Sumo)
Sumo15 [Niles and Pease, 2001] is being created as part of the IEEE Standard UpperOntology Working Group. The goal of this Working Group is to develop a standardupper ontology that will promote data interoperability, information search and retrieval,automated inference, and natural language processing. SUMO provides definitions forgeneral purpose terms and is the result of merging different free upper ontologies (e.g.Sowa’s upper ontology, Allen’s temporal axioms, Guarino’s formal mereotopology, etc.).There is a complete set of mappings from WordNet 1.6 synsets to Sumo: nouns, verbs,adjectives, and adverbs.
Sumo consists of a set of concepts, relations, and axioms that formalize an upperontology. An upper ontology is limited to concepts that are meta, generic, abstract orphilosophical, and hence are general enough to address (at a high level) a broad rangeof domain areas. Concepts specific to particular domains are not included in the upperontology, but such an ontology does provide a structure upon which ontologies for specificdomains (e.g. medicine, finance, engineering, etc.) can be constructed.
The current version of Sumo consists of 1,019 terms (all of them connected to WordNet1.6 synsets), 4,181 axioms and 822 rules.
We think that further investigation is needed with respect comparing both Sumo andthe EuroWordNet Top Concept Ontology. For instance, the typology of processes in theSumo was inspired by Beth Levin’s well-received work entitled ”Verb Classes and Alter-nations”. Among other things, this work attempts to classify over 3,000 English verbs into48 “semantically coherent verb classes”. Some of the verb classes relate to static predicatesin the ontology rather than to processes, and some classes are syntactically motivated, e.g.the class of verbs that take predicative complements.
Currently only the SUMO labels and the SUMO ontology hyperonym relations areloaded into the Mcr. We also performed a preliminary cross–checking process with theTop Concept ontology expansion, the Domain ontology and the the SUMO ontol-ogy (see Working Papers WP4.5, Towards de MEANING Top Ontology and WP4.7, TheMEANING Top Ontology for further details).
3.5 Local wordnets
In Working Paper WP3.3 we describe extensively the initial coverage of the Meaning
wordnets before the first uploading. In Working Papers WP4.2 Upload 0, WP4.4 Upload1 and WP4.6 Upload 2, we describe extensively the current coverage of the Meaning
wordnets after uploading the different local wordnets in the different cycles.New versions of the local wordnets for Spanish, Catalan, Basque and Italian has been
integrated in Mcr2. The figures about the number of synsets, variant and relations aresimilar to the previous versions integrated initially to Mcr0. During the last cycle, alllocal wordnets have have been improved and enriched.
3.5.1 Uploading Princeton WordNets
The current version of the Mcr contains most of the information represented in the Prince-ton WordNets.
The main changes uploading the Princeton WordNets to the Mcr consists of:
• Satellite Adjectives are coded as adjective (i.e Part of Speech s is converted to a).
• WordNet Verbal Frames are not loaded.
Uploading local wordnets not based on Ili from WordNet 1.6 is complex because be-tween different wordnet versions, synsets can be splitted (1:N), joined (N:1), added (0:1)or deleted (1:0). Thus, even if we perform manual checking of these connections, for thoseremaining cases of spliting or joining synsets the information inside the synsets should bemodified accordingly. At the moment, regarding Princeton WordNets 1.7, 1.7.1 and 2.0 itis not planned to make any manual checking of the mappings.
Mcr2 contains five version of the Princeton WordNet, and the first version of theeXtended Wordnet (aligned to WN1.7). Table 2 shows the overall figures for all theseWordNets.
3.5.2 WordNet 2.0
Special attention deserves WordNet 2.0 which includes more than 42,000 new links betweennouns and verbs that are morphologically related, a topical organization for many areas thatclassifies synsets by category, region, or usage, gloss and synset fixes, and new terminology,mostly in the terrorism domain.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
In this version, the Princeton team has added links for derivational morphology betweennouns and verbs. Furthermore, some synsets have been also organized into topical domains.Domains are always noun synset, however synsets from every syntactic category can beconnected. Each domain is further classified as a category, region, or usage.
In order to upload WordNet 2.0, is has been necessary to represent its new relationstypes (category, region, usage, related to) and their inverses (category term, region term,usage term) into the Mcr.
3.5.3 eXtended WordNet
In the eXtended WordNet16 [Mihalcea and Moldovan, 2001] the WordNet glosses are syn-tactically parsed, transformed into logic forms and the content words are semanticallydisambiguated. The key idea of the Extended WordNet project is to exploit the rich in-formation contained in the definitional glosses that is now used primarily by humans toidentify correctly the meaning of words. In the first version of the eXtended WordNet re-leased, XWN 0.1, the glosses of WordNet 1.7 are parsed, transformed into the logic formsand the senses of the words are disambiguated. Being derived from an automatic process,disambiguated words included into the glosses have assigned a confidence label indicatingthe quality of the annotation (gold, silver or normal). The quality of the relations derivedfrom XWN has been taken into account during the uploading of the XWN inside the Mcr2.First, associating different confidence scores to the relations according to its quality (gold1, silver 0.6, and normal 0.3). Secondly, associating different acquisition methods to therelations (xg, xs, xn respectively).
In order to upload coherently the eXtended WordNet into the Mcr, we also neededto upload WordNet 1.7 (integrated in the second cycle Mcr1) and build a new mappingbetween WordNet 1.6 and WordNet 1.7 17.
We think that further investigation is also needed with respect these resources. Forinstance, trying to derive automatically disambiguated semantic relations between synsetglosses [Gangemi et al., 2003].
We discovered several problems uploading eXtended WordNet 1.0. For instance, thenon standard normalization of the variants (capital letters, the use of the character – for
compound words instead of the white space, etc). After applying a case-insensitive, spacesubstitution set of heuristics, there remain 1,755 unexisting senses. Sometimes, the lemmaexists in WordNet with a different PoS. e.g: enrolling does only exists as verb in WordNet1.7 but not as noun.
As a further work, we can consider to upload and integrate the newest version (2.0.1) ofthe eXtended WordNet which is aligned to Princeton WordNet 2.0. We are also studyingthe possibility to upload in a near future our own versions of the eXtended WordNet (seeWorking Paper WP6.14 Experiment 6.L: Disambiguating WN Glosses).
3.6 Large collections of semantic preferences
Three large set of selectional preferences have been already uploaded in Mcr0 (see Deliv-erable D4.1 PORT0).
A total of 958,377 weighted Selectional Preferences (SPs) obtained from three differentcorpora and using different approaches have been uploaded into the Mcr.
The first set [McCarthy, 2001] of weighted SPs has been obtained by computing prob-ability distributions over the Wn1.6 noun hierarchy derived from the result of parsing theBNC. This set totalized 707.618 semantic relations. Part of these relations correspond torole–agent–bnc (115.542) and role–patient–bnc (95.065) in tables 9 and 10 respectively.The rest (497.011 relations) has been integrated into the Mcr as simple ROLE relation.
The second set [Agirre and Martinez, 2002] has been obtained from generalizations ofgrammatical relations extracted from Semcor. This set totalized 203.546 semantic rela-tions. These relations correspond to role–agent–semcor (69.840) and role–patient–semcor(110.102) in tables 9 and 10 respectively.
The third set of Selectional Preferences comes also from SemCor, which has been alsoparsed using a new version of Minipar [Lin, 1998a]. All the subject and object syntacticdependencies between head synsets can be captured and uploaded into Mcr. This re-source allows direct comparisons between word instances and the generalized SelectionalPreferences (captured from SemCor and BNC). Conversely to the other two sets, the syn-tactic dependencies captured by this process will not be generalised. This set totalized23,609 semantic relations corresponding to direct subject and object relations commingfrom the Minipar output. These relations correspond to role–agent–semcor2 (10.196) androle–patient–semcor2 (13.408) in tables 9 and 10 respectively.
The SPs were included in the Mcr as ROLE noun–verb relations 18. Although we candistinguish subjects and objects, all of them have been included as a more general ROLErelation.
3.7 Instances
The Mcr2 contains three different sources of name entities and instances:
• 6,961 Named Entities from the work of [Alfonseca and Manandhar, 2002]
18In EuroWordNet, INVOLVED and ROLE relationships are defined symmetrically.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
• 5,561 Named Entities from Sumo [Niles and Pease, 2001]
• 4,097 Named Entities from MultiWordNet [Pianta et al., 2002]
For future versions of the Mcr, we suggest to provide a new ontology of Named Entitiesto support and cover the formal criteria followed by the three approaches. This initiativewould be very useful when comparing Named Entities derived using different languageprocessors.
3.8 VerbNet
VerbNet19 [Kipper et al., 2000] is a verb lexicon with syntactic and semantic informationfor English verbs, using Levin verb classes to systematically construct lexical entries. Foreach syntactic frame in a verb class, there is a set of semantic predicates associated with it.Many of these semantic components are cross-linguistic. The lexical items in each languageform natural groupings based on the presence or absence of semantic components and theability to occur or not occur within particular syntactic frames. The English entries aremapped directly onto English WordNet senses. We hope that this new resource will providefurther structure and consistency to the selectional preferences acquired automatically.
19http://www.cis.upenn.edu/old/verbnet/home.html
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
This section summarises PORT2 which includes: uploading, integrating and porting pro-cesses.
4.1 Uploading process
Working Paper WP4.6 Upload 2 provides an extended report of the third upload process.Now, we uploaded again the new versions available for Spanish, Catalan and Basque word-nets. Apart of these, new resources has been uploaded in this third round, e.g. ImprovedWordNet Domains (2nd release) [Magnini and Cavaglia, 2000], Base Concepts (2nd release)[Vossen, 1998], EuroWordNet Top Concept Ontology (2nd release) [Atserias et al., 2004a],VerbNet [Kipper et al., 2000].
4.2 Integration Process
In Mcr2 we have integrated five different versions of the Princeton WordNet. Althoughthese WordNets are quite similar and Meaning has the technology for mapping accuratelywordnet versions making them compatible and then allowing to reuse many other valu-able semantic resources 20, the Mcr2 have some hundreds of links with multiple choicesrequiring manual verification (see also Working Paper WP4.3 Making wordnets compatible).
In the second cycle we studied the impact of the transformation of the other WordNetversions to WordNet 1.6 [Atserias et al., 2004b]. We need to explore new techniques todetect automatically the most problematics cases.
This section will also provide a partial description of all knowledge uploaded and inte-grated into the Mcr2 (see Working Paper WP4.6 Upload 2 for an extended report). Thiswill provide some preliminary information useful for the third PORTing. As the overlap-ping between wordnets is a crucial issue when Porting the knowledge acquired from onelanguage to the other, we provide some figures regarding the number of common synsets(local wordnets are far from the coverage of Princeton English Wordnet), common relationsand different views of all the knowledge uploaded (WordNet Domains, Wordnet SemanticFiles, Base Concepts, EuroWordNet Top Ontology) to measure the qualitative overlappingof the local wordnets.
4.2.1 Cross–checking
We think that further investigation is also needed with respect the resources currentlyintegrated into the Mcr. For instance, trying to derive automatically disambiguatedrelations between synset glosses. In fact, after uploading all these new resources to theMcr, a new set of complex integration and porting processes must be studied. This willallow designing sophisticated strategies and metarules for subsequent portings. Obviously,
20http://www.lsi.upc.edu/~nlp/tools/mapping.html
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
integrating all these large–scale semantic resources into a single platform a complete cross-checking research can be performed. For instance, we can improve both the Sumo labelswith the WordNet Domains by simply merging and comparing them.
Synset Word SUMO Domain00536235n blow Breathing anatomy00005052v blow Breathing medicine
To illustrate how we can detect errors and inconsistencies between different types ofknowledge, we can see in the example in table 3 that systematically, the nouns correspond-ing to the Sumo process Breathing has been labelled with ANATOMY domain, some verbswith MEDICINE and some adjectives with FACTOTUM, when in fact, all these sensescorrespond to different Part-of-Speech of the same Breathing concept.
In order to illustrate the kind of problems we need to face when merging all thesesemantic resources into a single and common platform, consider the example shown infigure 4. The act playing#1 which is a kind of musical performance#1 is connected byderivational morphological relations to three senses of the verb play. The verb play#3 isconnected by a domain relation to the noun music#1 and the verb play#7 is connectedto music#1 and music#3. However, play#6, also related to the musical domain is notconnected by a domain relation to none of music#1 nor music#3. All the three senses ofthe verb play have the WN Domain MUSIC label and the Sumo music label. However,each verb sense of play have different behaviour assigning category relations. Should thenoun playing be also connected by a category relation to both music#1 and music#3?Should be made explicit this connection? Regarding WN Domain labels, why the musical
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
senses of the verb play and the noun music do not have also the FREE TIME label asthe noun act playing? With respect Sumo, why they have different types? Furthermore,being the eXtended WordNet the result of an automatic process, it contains also wrongdisambiguations (play#4 belonging to the THEATRE domain). We think that having allthis different sources of knowledge uploaded and integrated into the same framework willallow to improve systematically all this misleading inconsistencies.
RELATED-TO
the act of playing a
musical instrument
DOMAIN free_time music
SUMO &%RecreationOrExercise+
00093905n
playing
the act of performing music
DOMAIN free_time music
SUMO &%RecreationOrExercise+
00092967n
musical performance
RELATED-TO
RELATED-TO
play on an instrument;
"The band played all night long"
DOMAIN music
SUMO &%music+
01675975v
play#3
perform music on a musical instrument;
"He plays the flute";
"Can you play on this old recorder?"
DOMAIN music
SUMO &%music+
01677078v
play#7
re-play (as a melody);
"Play it again, Sam";
"She played the third movement very beautifully"
DOMAIN music
SUMO &%music+
01675975v
play#6
spiel#1
CATEGORY
an artistic form of auditory
communication incorporating
instrumental or vocal tones
in a structured and continuous
manner
DOMAIN music
SUMO &%music+
06591368n
+music#1
CATEGORYmusical activity (singing
or whistling etc.);
"his music was his central
interest"
DOMAIN music
SUMO &%music+
00515842n
music#3
CATEGORY
play a role or part;
"Gielgud played Hamlet"; ...
DOMAIN theatre
SUMO &%Pretender+
01670298v
act#3
play#4
represent#10
GLOSS
Figure 4: Example of noun playing
On the other hand, regarding the top–down expansion of the properties of the Eu-roWordNet Top Ontology through WordNet (see Working Papers WP4.5 and WP4.7),problematic cases can be detected by cross-checking the different resources in the Mcr.
There are attributes of the Top Concept ontology that can not be inferred top–downfrom the hand-made assignments. Another way of automatically enrich wordnet withmore attributes is using the semantic file of the synset. For example, the synset 10960967-n first half only has the attribute Part. But its semantic File is noun.time. Thus theassociated Top Concept ontology property Time could be added (also note that sumolabel is TimeInterval+).
The problems/inconsistencies found can be classified into:
• WordNet hierarchy
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
The classification of Wn is not always consistent with the Top Concept ontology
– Animal vs. Plant
00911639n phytoplankton 1 (SUMO.Plant+) and its direct descendant 00911809nplanktonic algae 1 (SUMO.Alga).
– Substance (Liquid, Solid, Gas) vs. Object
For instance, body part 1 is an Object. However, some of their descendants haveincompatible properties:
∗ Liquid 04195761n 105 liquid body substance 1 bodily fluid 1 body fluid 1
∗ Substance 4086329n 117 body substance 1 the substance of the body
∗ Solid 06672286n covering 1 natural covering 1 cover 5 any covering for thebody or a body part
• Cross-checking resources with different granularitiesFor instance the division between Human–Creature–Animal–Hominid:
– Human vs Animal: All the Hominids are considered animal by the semanticFile, but Human by the top Concept ontology (SUMO Hominid+)
– Human vs Creature: All the creatures (mainly the descendants of imag-inary being 1 imaginary creature 1) are classified as Human by the semanticFile.
• Multiple inheritance: piece of leather 1
WordNet is not a tree and a synset can have more than one direct ancestor. Thus,it can inherit attributes from its multiples ancestors. Figure 5 shows the compli-cated multiple inheritance of piece of leather (on the top) inheriting Living (whichis obviously incorrect) and Natural Attribute (which could be questionable) fromskin but also Part from its other ancestors. The Multiple inheritance could bringup also a new type of problems. Figure 6 shows another example where multipleinheritance will lead to inherited incompatible attributes: Artifact from and Naturalfrom organic compound 1.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
Tables 4, 5, 6 and 7 shows the overlapping for nouns, verbs, adjectives and adverbs betweeneach wordnet pair.
At a synset level, noun overlapping is quite high and homogeneous between wordnetpairs. The maximum overlapping occurs between English and Spanish (38,023) and thelowest between Italian and Catalan (16,360).
For verbs, at a synset level, the overlapping is also quite high but less uniform betweenwordnet pairs. The maximum overlapping occurs also between English and Spanish (8,830)and the lowest between Italian and Basque (1,977).
At a synset level, adjective overlapping is not high because some wordnets providepoor coverage on adjectives. While Spanish provides good overlapping with English (themaximum overlapping with 14,667 synsets), Basque wordnet only provide some hundredsof adjectives.
At a synset level, adverbs overlapping is not high because some wordnets provide poorcoverage on adverbs. While Italian provides good overlapping with English (the maximumoverlapping with 1,093 synsets), Catalan and Basque wordnet do not provide adverbs atall.
4.2.3 Coverage of semantic relations
We describe in this section the results of performing some basic comparisons betweenall wordnets currently integrated into the Mcr. This will provide also some preliminary
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
information useful for the third PORTing. In particular, we compare the current coverageof the different semantic relations across all wordnets.
Table 8 summarises the overlapping relations between all wordnets. The local wordnetsdeveloped following the EuroWordnet framework (Basque, Spanish and Catalan) share thesame amount of relations. Thus, we use Spanish as the model to compare with ItalianWordnet and English Princeton Wordnet. We can also see that those wordnets derivedfrom EuroWordNet represent much richer information (they present a large variety ofsemantic relations), than those derived from WordNet.
Working Paper WP4.6 Upload 2 provides further analysis regarding other knowledgeuploaded and integrated into Mcr2 (mainly, Domains, Semantic Files, Top Ontology,SUMO, etc.)
4.2.4 Realisation
During the first round we performed a Realization Process expanding all the Top Con-cept Ontology properties following the WordNet hierarchies. We are now producing anew and consistent version of the Top Concept Ontology (see Working Papers WP4.5 To-wards de MEANING Top Ontology and WP4.7 The MEANING Top Ontology). No furtherRealization was planned for this round.
Once having a new version of the Base Concepts and associated Top Concept Ontologyproperties, the Mcr performed a full expansion process through the nominal and verbalhierarchies.
Some of the selectional preferences acquired from SemCor and BNC can also be in-herited through the nominal part of the hierarchy. This process involves also a heavy
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
computational effort. In fact, some other WordNet relations can be derived in the sameway (for instance, meronym relations, etc.)
By expanding this knowledge, we are making explicit all the knowledge contained intothe Mcr. Mcr should consider to implement some capabilities to mechanise this process.All inferred relations and knowledge should be rebuild several times during the integrationprocess without losing information or consistency.
We started to work using the version integrated in version Mcr1 having 2.696 TCOfeatures expanded by inheritance to 253.003 features. At this moment, we have reachedthe figure of 2.756 hand-coded features which expand to 276.384. Moreover, 52 blockingpoints have been set.
Comparing both versions:
1. Both versions share
2.676 hand-coded features (corresponding to 1.013 different synsets)51.043 expanded features (corresponding to 36.289 different synsets)
2. Differences
The initial version had 201.960 expanded features belonging to 75.052 synsets whichnow are not present. The new version has 225.341 new expanded features, belongingto 75.295 synsets.
See Working Papers WP4.1, WP4.5 and WP4.7 for further details of the Top ConceptOntology currently integrated into the Mcr.
4.2.5 Generalisation
A similar process can be devised in order to expand the knowledge into the Mcr. In thiscase, rather than expanding top–down the knowledge and properties represented into theMcr, a full bottom–up generalisation mechanism can be performed. In this case, differentknowledge and properties can collapse on particular Base Concepts and ontological proper-ties. We suggest further analysis of this possibility. However, no further Generalizationwas planned for this round.
4.3 Porting Process
Having all this types of different knowledge and properties completely expanded throughthe whole Mcr, a new set of inference mechanism can be devised in order to furtherinfer new relations and knowledge. For instance, new relations can be generated whendetecting particular semantic patterns occurring for some synsets having certain ontologicalproperties, for a particular Domains, etc. That is, new relations can be generated whencombining different methods and knowledge. For instance, when several relations derivedin the integration process have particular confidence scores greater than certain thresholds.We also suggest further analysis of this possibility.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
As we have greatly improve the quality of the local wordnets and the their associatedinformation, we decide to redo the porting process from scratch keeping trace of the sourceof the information.
In PORT1 and PORT2 we decided to incorporate the new knowledge encoded into thenew versions of the Princeton WordNet (1.7, 1.7.1, 2.0 and eXtended WordNet). Thus, weported to WordNet1.6 the new relations types (e.g. usage). However, we discart to portthe old relation types already present in version 1.6 (e.g. has holo made of, has hyponym,etc. ). They will be probably inconsistent with the current content aligned to WordNet1.6.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
Without having inferred extra knowledge in this porting process all the knowledge inte-grated into the Mcr has been ported (distributed) to the local wordnets. That is, thisprocess finish producing exporting XML files for all local wordnets.
Tables 9 and 10 summarises the main results before the whole porting process (UP-LOAD2) and after the porting process of (PORT2). All wordnets gained some kind ofnew knowledge coming from other wordnets by means of the third porting process. Adirect result of the upload/integration/porting effort is that all information associated tothe Ilis is automatically ported to the other wordnets. Thus, WordNet Domains are nowavailable to the rest of local wordnets, EuroWordnet Top Ontology is also available forItalian WordNet and for English Wordnet 1.6, and the SUMO labels have been ported toCatalan, Italian and Spanish. Moreover, local relations can be ported to the rest of word-nets. Thus, Italian and English Wordnet can be enriched with all the new set of relationscoming from EuroWordnet. In turn, Basque, Catalan, Italian and Spanish wordnets canbe extensively enriched with the large amounts of relations coming from newer versions ofWordNet, eXtended WordNet and the selectional preferences acquired from English.
In these tables, we do not consider hypo/hypernym relations. links stands for totalnumber of Domains or Top Ontology labels ported (one synset could have more than onelabel). Selectional Preferences have been included in the database as from noun to verbrelations (ROLE) instead of relations from verb to noun (INVOLVED) 21. Although wecan distinguish subjects and objects in the database, all of them have been included asa more general ROLE relation. Role–agent–semcor stands for those subject selectionalpreferences acquired from SemCor. Role–patient-semcor stands for object selectional pref-erences acquired from SemCor. Role–agent–semcor2 stands for those subject selectionalpreferences acquired from directly parsing SemCor. Role–patient-semcor2 stands for ob-ject selectional preferences acquired from directly parsing SemCor. Role–agent–bnc standsfor those subject selectional preferences acquired from the British National Corpus, andRole–patient–bnc for those typical objects acquired from BNC. In fact, some of them mayoverlap. There are other 497.011 more general ROLE relations not included into the tables.We need to investigate new inference facilities to enhance the porting process as suggestedbefore.
Thus, for English, the current Mcr totalize more than one million and a half non hier-archical (hypo/hypernym) relations. In contrast, the version of WordNet 2.0 has 108,484non hypo/hypernym relations22.
In that way, the Mcr produced by Meaning is going to constitute the natural mul-tilingual large-scale linguistic resource for a number of semantic processes that need largeamounts of linguistic knowledge to be effective tools (e.g. Web ontologies). The fact thatword senses will be linked to concepts in Mcr will allow for the appropriate representation
21INVOLVED and ROLE relations are symmetric22Inverse relations are counted just once
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
and storage of the acquired knowledge.Mcr2 integrates now into the same EuroWordNet framework (using a new version of
Base Concepts, the Top Ontology and the WordNet Domains) five local wordnets (with fourEnglish WordNet versions) with hundreds of thousand of new semantic relations, instancesand properties fully expanded. All wordnets gained some kind of new knowledge comingfrom other wordnets by means of this porting process. In fact, the resulting Mcr2 is thelargest and richest multilingual lexical–knowledge ever built.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
Although this is the final version of the Mcr in Meaning our plan is to continue exploringand improving this multilingual resource in several ways.
6.1 Further Uploading
6.1.1 Improved Selectional Preferences acquired from BNC
The application and study of these sets of SPs seem to indicate that both methodologiessuffer from an overly high level of generalization. In Mcr2 we uploaded a new sets of verbspecific Selectional Preferences obtained from semcor using a new methodology based onprotomodels (see Working Paper WP5.8 Experiment 5.G: Selectional Preferences (2ndround)).
The Tree Cut Models (tcms) that we used in the first round of Meaning acquisitionfor learning selectional preferences from unannotated text often suffered from an overlyhigh level of generalisation, that is classes which are very high in the WordNet hierarchyare used to represent the preferences. Table 11 shows volumes of the data uploaded. Table12 shows an example of the kind of information obtained for the noun church. Not onlycan be retrieved the list of verbs associated directly to a sense (direct), but also the listassociated to the n
th hypernym of the synset (hyper–n).We are investigating 3 possibilities to acquire more specific, accurate and intuitive
models:
Weighting TCMs a proposal by [Wagner, 2002] to introduce a weighting factor to counterthe effect of data size on the model.
WSD on input data the use of automatic WSD on the training data to counter theeffect of polysemy
Protomodels new selectional preference models which aim to cover only a portion ofthe data where that portion can be disambiguated and where the disambiguation isperformed using a ratio of types in a class, rather than tokens.
From ACQ1 we obtained a large set of sense examples acquired automatically from theweb (see Working Paper WP5.5 Experiment 5.H a): Publicly available topic signaturesfor all WordNet nominal senses). These examples have been obtained querying Google.For each word sense in WordNet, a program builds a complex query including sets ofmonosemous synonymous relatives. Using this approach, large collections of text can beobtained. This will represent hundreds of examples per word sense. Using this large-scaleresource we generated Topic Signatures 23 [Agirre and de Lacalle, 2004] for every wordsense in WordNet.
In fact, we released this publicly available resource which comprises both automaticallyextracted examples for all WordNet 1.6 noun senses and topic signatures built based onthose examples. We gathered around 700 sentences per each noun in WordNet. When themonosemous relatives are used to build a sense corpus for polysemous words, they comprisean average of around 3,500 sentences per word sense. The size of the topic signatures thusconstructed is of around 4,500 words per word sense.
Table 13 presents the Topic Signatures (list of words and associated weights) for sense6 of horse heroin, diacetyl morphine, H, horse, junk, scag, shit, smack ”a morphine deriva-tive”.
However, as we mentioned for the improved Selectional Preferences these Topic Signa-tures should be disambiguated before full integration into the Mcr.
6.1.3 Non subject/object Selectional Preferences
In the three past Meaning cycles we decided to incorporate only subject/object SelectionalPreferences acquired from different corpora and techniques. However, the English parsersused for this task already produce other very valuable dependencies. Our plan is to uploadand integrate the rest of Selectional Preferences captured.
6.1.4 Large collections of Sense Examples
We can also integrate into Mcr all the sense examples appearing in SemCor. Currently,WordNet glosses contain some usage examples of the described concept. We can incor-porate also as usage examples the sentences corresponding to all the sense occurrences ofSemCor.
6.2 Further Integration
First, we suggest for future rounds a manual validation of the Top Concept Ontology anda new expansion (Realization) of the properties.
We also suggest a full expansion (Realization) through the nominal part of the hier-archy of the selectional preferences acquired from SemCor and BNC (and possibly other
23http://ixa.si.ehu.es/Ixa/resources/sensecorpus
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
implicit semantic knowledge currently available in WordNet such as meronymy informa-tion).
We also suggest further investigation to perform also full bottom–up expansion (Gen-eralization), rather than merely expanding top–down the knowledge and properties rep-resented into the Mcr. In this case, different knowledge and properties can collapse onparticular Base Concepts, Semantic Files, Domains and/or ontological nodes.
6.3 Porting Process
The consortium needs to investigate also a new set of inference mechanism in order tofurther infer new relations and knowledge inside the Mcr. For instance, new relationscan be generated when detecting particular semantic patterns occurring for some synsetshaving certain ontological properties, for a particular Domains, etc. That is, new relationscan be generated when combining different methods and knowledge. For instance, whenseveral relations derived in the integration process have particular confidence scores greaterthan certain thresholds.
However, without this new inference tool (i.e. without having inferred extra knowledge)in this porting process all the knowledge integrated into the Mcr will be ported to thelocal wordnets.
Mcr2 integrates now into the same EuroWordNet framework (using a new version ofBase Concepts, the Top Concept Ontology and the WordNet Domains) five local wordnets(with five English WordNet versions) with hundreds of thousand of new semantic relations,instances and properties fully expanded. In fact, the resulting Mcr2, with more than 1,6million relations, will be one of the largest and richest multilingual lexical–knowledge everbuilt.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
When uploading coherently all this knowledge into the Mcr a full range of new possibilitiesappear for improving both Acquisition and WSD problems (and other Semantic Processes).We will illustrate these new capabilities by two simple examples.
7.1 The ”Vaso” Example
The Spanish noun vaso has three possible senses. The first one is connected to the sameIli as the English synset <drinking glass glass>. This Ili record, belonging to the Seman-tic File ARTIFACT has no specific WordNet Domain (FACTOTUM). However, the TopConcept Ontology provides further clues about its meaning: it has the following propertiesForm-Object, Origin-Artifact, Function-Container and Function-Instrument. The Sumo
type for this synset is also Artifact. A valuable information also comes from the disam-biguated glosses included into the eXtended WordNet. This gloss has two ’silver’ words24
(glass, container) and three ’normal’ words (the rest). For instance, hold#VBG#8 corre-sponds to “contain or hold; have within: ”The jar carries wine”; ”The canteen holds freshwater”; ”This can contains water”). The reverse relation rgloss can be used to explorein which definitions vaso 1 is used. In this case, 36 relations (most of them nouns, butalso three verb senses and two adjective senses). Further, coming from the SelectionalPreferences acquired from SemCor, we know that the typical things that somebody doeswith this kind of vaso are for instance the corresponding equivalent translations to Spanishfor <polish, shine, smooth, smoothen> or <beautify, embellish, prettify>. After parsingSemCor with Minipar [Lin, 1998a], we also included into the Mcr those synsets appearingas subjects and direct objects. That is, without performing any kind of generalization.Obviously, this information is similar to the generalized classes provided by [Agirre andMartinez, 2002]. In this case, the subjects and direct objects captured when parsing Sem-Cor are the corresponding equivalent translations of <pass, hand, reach, pass on, turn over,give> or <offer, proffer>. We also included the verbal the improved proto–models acquiredfrom BNC. In this case, the object proto–models are for instance: “lift” and “roll”. Word-Net 2.0 also provides a new morphological derivational relation: to glass#v#4 “put ina glass container”. Finally, we must add that this also holds for the rest of languagesconnected.
vaso_1 02755829-n
SF: 06-NOUN.ARTIFACT
DOMAIN: FACTOTUM
SUMO: &%Artifact+
TO: 1stOrderEntity-Form-Object
TO: 1stOrderEntity-Origin-Artifact
24High confidence
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
(ANATOMY). The EuroWordNet Top Ontology in this case, has the following propertiesForm-Substance-Solid, Origin-Natural-Living, Composition-Part and Function-Container.The Sumo label provides the properties and axioms assigned to BodyVessel. This glosshas two ’gold’ words 25 (tube and circulate) and one ’silver’ (body fluid) and the last wordis monosemous. From the Selectional Preferences acquired from SemCor, we know thatthe typical events applied to this king of vaso are for instance the corresponding equiv-alent translations to Spanish for <inject, shoot> or <administer, dispense>. Observingthe rgloss relation we can see that this sense is related to the verb extrangulate 1 or tothe nouns bascular system 1 blood vessel 1. In total 34 relations (most of them to nounsbut also 7 to verbs and 3 to adjective concepts). In this case, the subjects and directobjects captured when parsing SemCor are the corresponding equivalent translations of<follow, travel along> and <be, occur>; and the proto–models are for instance: “open”and “show”. In this case, there are no new relations coming from WordNet 2.0. As before,we must add that this knowledge can be also ported to the rest of languages connected.
vaso_2 04195626-n
SF: 08-NOUN.BODY
DOMAIN: ANATOMY
SUMO: &%BodyVessel+
TO: 1stOrderEntity-Form-Substance-Solid
TO: 1stOrderEntity-Origin-Natural-Living
TO: 1stOrderEntity-Composition-Part
TO: 1stOrderEntity-Function-Container
EN: vessel vas
IT: vaso dotto canale
BA: hodi baso
CA: vas
04195626-n vessel vas:
GLOSS: a tube in which a body fluid circulates
eXtended WordNet:
GLOSS: a tube#NN#4 in which a body_fluid#NN#1 circulate#VBZ#4
04195626 01590833 0.0002 furnish provide render supply
04195626 01612822 0.0001 act move
04195626 01775973 0.0000 be
Proto-Classes:
open dobj 04195626 0.0006462453
show subj 04195626 0.0001756852
The last sense of vaso is the equivalent translation of <glassful, glass>. This Ili
record, belongs to the Semantic File QUANTITY and has assigned a different WordNetDomain (FACTOTUM-NUMBER). The Top Concept Ontology in this case, has the follow-ing properties Composition-Part SituationType-Static and SituationComponent-Quantity.The Sumo label provides the properties and axioms assigned to ConstantQuantity. Thisgloss has only one ’silver’ word from the eXtended WordNet (quantity). The other twohave label ’normal’. From the Selectional Preferences acquired from SemCor, we know thatthe typical events applied to this king of vaso are for instance the corresponding equivalenttranslations to Spanish for <drink, imbibe> or <consume, have, ingest take, take in>.Similar information appear for the parsed SemCor, and no direct proto–models have beenacquired. In this case, there are no new relations coming from WordNet 2.0. As before,we must add that this knowledge can be also ported to the rest of languages connected.
vaso_3 09914390-n
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
GLOSS: the quantity#NN#1 a glass#NN#2 will hold#VB#1
DOBJ SemCor:
09914390 00795711 0.0026 drink imbibe
09914390 01530096 0.0009 accept have take
09914390 00786286 0.0009 consume have ingest take take_in
09914390 01513874 0.0001 acquire get
DOBJ Semcor No generalization:
09914390 00795711 drink imbibe
09914390 01530096 accept have take
As we can see, we can add consistently a large set of explicit knowledge about each senseof vaso that can be used to differentiate and characterize better their particular meanings.We expect to devise appropriate ways to exploit this unique resource in the next rounds.
7.2 The ”Pasta” Example
We will continue illustrating the current content of the Mcr, after porting, with anothersimple example: the Spanish noun pasta.
The word pasta (see tables 15 and 14) illustrates how all the different classificationschemes uploaded into the Mcr: Semantic File, WordNet Domain, Top Concept Ontol-ogy, etc. are consistent and makes clear semantic distinctions between the money sense
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
(pasta 6), the general/chemistry sense (pasta 7) and the food senses (all the rest). Thefood senses of Pasta can now be further differentiate by means of explicit Top ConceptOntology properties. All the food senses are descendants of substance 1 and food 1 andinherits the Top Concept attributes Substance and Comestible respectively.
Table 14: Substance and money senses for the Spanish word pasta
Selectional Preferences can also help to distinguish between senses, e.g only the moneysense has the following preferences as object: 1.44 01576902-v {raise#4}, 0.45 01518840-v {take in#5, collect#2} or 0.23 01565625-v {earn#2, garner#1} or 0.12 01564908-v{clear#15, take in#10, make#10, gain#8, realize#4, pull in#2, bring in#2, earn#1}.
Table 16 presents the new selectional preferences acquired for the Spanish word Pasta.That is, the prototypical verbs associated to each English equivalent translation or theirhypernyms.
We can also investigate new inference facilities to enhance the integration process. Afterfull expansion (Realization) of the Ewn Top Concept ontology properties, we will performa full expansion through the noun part of the hierarchy of the selectional preferencesacquired from SemCor and BNC (and possibly other implicit semantic knowledge currentlyavailable in Wn such as meronymy information).
We plan further investigation to perform full bottom–up expansion (Generalization),rather than merely expanding knowledge and properties top-down. In this case, differentknowledge and properties can collapse on particular Base Concepts, Semantic Files, Do-mains and/or Top Concepts in order to become automatically possible semantic roles forpredicates.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
Top Concept ontologyPart-composition-1stOrderEntitypasta#n#4 05886080-nspread#5,paste#3gloss: a tasty mixture to bespread on bread or crackers
pasta#n#1 05671312-npastry#1,pastry dough#1gloss: a dough of flour andwater and shorteningpasta#n#3 05739733-npasta#1,alimentary paste#1gloss: shaped and drieddough made from flour andwater & sometimes egg
pasta#n#5 05889686-ndough#1gloss: a dough of flour andwater and shortenings
Top Concept ontologyArtifact-Origin-1stOrderEntityGroup-Composition-1stOrderEntity
pasta#n#2 05671439-npie crust#1,pie shell#1gloss: pastry used to hold piefillings
Table 15: Food senses for the Spanish word pasta
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
pany 0,0017 serve 0,0016 hate 0,0013 prepare0,0013 pass 0,0007 do 0,0005 keep 0,0005 include0,0004 love 0,0004 like 0,0003 hold 0,0002 make0,0001 produce 0,0001
pasta#n#5 hyper-1 105909338 divide 0,0127 wrap 0,0063 pack 0,0045 mix 0,0044press 0,0025 check 0,0013 pass 0,0007 add 0,0006eat 0,0006 make 0,0005 prevent 0,0004 remove0,0004 produce 0,0002 leave 0,0001 like 0,0001
Table 16: New Selectional Preferences for Food senses of “pasta”
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
Finally, we will conclude this set of examples with the current content information alreadyintegrated into the MCR for the sense 1 of hospital.
Figure 7 shows the set of properties (Domains, Semantic File, SUMO and Top ConceptOntology) associated to each disambiguated concept of the synset gloss (using eXtendedWordnet). This definition describes an hospital (sense 1) as a healh facility (sense 1)where patients (sense 1) receives (sense 2) treatment (sense 1). The fact of having all thewords from the gloss correctly disambiguated inside the Mcr provides an open range ofpossibilities to be explored.
The domains of hospital 1 are building industry, medicine and town planning; the Se-mantic File is artifact; the SUMO label is StationaryArtifact; finally, the Top ConceptOntology properties, inherited from the hypernym chain, are Artifact, Building and Ob-ject. The domains of health facility 1 are building industry, medicine and town planning;the Semantic File is artifact; the SUMO label is Building; finally, the Top Concept On-tology properties, inherited from the hypernym chain, are Artifact, Building and Object.The domain of patient 1 is medicine; the Semantic File is person; the SUMO label is Pa-tient; finally, the Top Concept Ontology properties, inherited from the hypernym chain,are Function, Human, Living and Object. receive 2 has no explicit domain assigned. How-ever, the Semantic File is change, the SUMO label is Getting, and finally, the Top ConceptOntology properties are Dinamic and Experience. Finally, the domain for treatment 1 ismedicine; the Semantic File is act; the SUMO label is TherapeuticProcess; and the TopConcept Ontology properties are Agentive, Cause, Condition, Dinamic, Purpose, Socialand UnboundedEvent. Furthermore, receive 2 and treatment 1 are both Base Concepts.
In fact, the hospital 1 is the place where it occurs the event that a patient 1 receive 2treatment 1. Or in other words, in all health facilities patients gets therapeutic processes?Which kind of further inferencing capabilities can be derived from the knowledge currentlyintegrated into the Mcr?
We want to stress that all this knowledge is also available to the rest of the languagesincluded into the MCR. That is, we can derive automatically partial translations of theEnglish glosses to the rest of languages integrated into the MCR. For instance, in Figure17 we present the translations in Spanish for all the content words of the hospital 1 glossappearing into the MCR. We want to stress again that they also share the same propertiesand relations.
This knowledge could be used to derive complete and more accurate translations of theEnglish glosses (see Working Paper WP3.9 Statistical Machine Translation of WordNetglosses for further details of this process).
For instance, the inherent knowledge structure presented in this definition matchesappropriately with the scripts-like frame structures of FrameNet [Baker et al., 1997]. Figure18 presents the corresponding slots for the frame Treatment.n. In particular, the hospital 1example seems to fit correctly with several slots of this frame: core slot Patient (patient),core slot Treatment (treatment) and pheripherical slot Place (hospital). We should deviseways to integrate Framenet-like structures in future versions of the MCR, maybe taking
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
English Spanish Italianhospital hospital ospedalehealth facility centro medico, centro de salud NOLEXpatient paciente pazientereveive obtener otteneretreatment tratamiento, terapia cura, terapia
Table 17: hospital 1 gloss translation equivalences for Spanish and Italian
Frame elements TypeAffliction CoreBody part CoreDegree PeripheralDuration Extra-ThematicHealer CoreManner PeripheralMedication CoreMotivation Extra-ThematicPatient CorePlace PeripheralPurpose Extra-ThematicTime PeripheralTreatment Core
Table 18: Treatment.n frame
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
This document described the third version of the Multilingual Central Repository (Mcr2)and the third Porting process (PORT2). We described the knowledge uploaded and inte-grated into Mcr2, including a brief description of a general Upload/Porting architecture.Finally, we provide a full description of the third Porting process.
The current version of the Mcr integrates wordnets from five different languages. Thefinal version of the Mcr contains 1,642,389 unique semantic relations between concepts(ILI-records). This represents one order of magnitude larger than the Princeton wordnet(138,091 unique semantic relations in WN1.6). Table 19 summarizes the main sources forsemantic relations integrated into Mcr2.
source #relationsAcquired from Princeton WN1.6 138.091Selectional Preferences acquired from SemCor 203.546Selectional Preferences acquired from BNC 707.618New relations acquired from Princeton WN2.0 42.212Gold relations from eXtended WN 17.185Silver relations from eXtended WN 239.249Normal relations from eXtended WN 294.488Total 1.642.389
Table 19: Main sources of semantic relations
Furthermore, the current Mcr have been also enriched with 466,972 semantic prop-erties coming from different sources. Table 20 summarizes the main sources for semanticproperties integrated into Mcr2.
In fact, the resulting Mcr2 is the largest and richest multilingual lexical–knowledgeever built. In that way, the Mcr produced by Meaning is going to constitute the naturalmultilingual large-scale linguistic resource for a number of semantic processes that needlarge amounts of linguistic knowledge to be effective tools.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
[Agirre and de Lacalle, 2004] Eneko Agirre and Oier Lopez de Lacalle. Publicly availabletopic signatures for all wordnet nominal senses. In Proceedings of the 4rd InternationalConference on Language Resources and Evaluations (LREC). Lisbon, Portugal., 2004.
[Agirre and Martinez, 2001] E. Agirre and D. Martinez. Learning class-to-class selectionalpreferences. In Proceedings of CoNLL01, Toulouse, France, 2001.
[Agirre and Martinez, 2002] E. Agirre and D. Martinez. Integrating selectional preferencesin wordnet. In Proceedings of the first International WordNet Conference in Mysore,India, 21-25 January 2002.
[Agirre et al., 2002] E. Agirre, O. Ansa, X. Arregi, J.M. Arriola, A. Diaz de Ilarraza,E. Pociello, and L. Uria. Methodological issues in the building of the basque wordnet:quantitative and qualitative analysis. In Proceedings of the first International WordNetConference in Mysore, India, 21-25 January 2002.
[Alfonseca and Manandhar, 2002] E. Alfonseca and S. Manandhar. An unsupervisedmethod for general named entity recognition and automated concept discovery. In Pro-ceedings of the 1st International Conference on General WordNet, Mysore, India, 2002.
[Atserias et al., 1997] J. Atserias, S. Climent, X. Farreres, G. Rigau, and H. Rodrıguez.Combining multiple methods for the automatic construction of multilingual wordnets.In Proceedings of RANLP’97, pages 143–149, Bulgaria, 1997.
[Atserias et al., 2004a] J. Atserias, S. Climent, and G. Rigau. Towards the meaning topontology: Sources of ontological meaning. In 4rd International Conference on LanguageResources and Evaluations (LREC), 2004.
[Atserias et al., 2004b] J. Atserias, G. Rigau, and L. Villarejo. Spanish wordnet 1.6: Port-ing the spanish wordnet across princeton versions. In Proceedings of the 4th InternationalConference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal,2004.
[Baker et al., 1997] C. Baker, C. Fillmore, and J. Lowe. The berkeley framenet project.In COLING/ACL’98, Montreal, Canada, 1997.
[Banerjee and Pedersen, 2003] S. Banerjee and T. Pedersen. Extended gloss overlaps as ameasure of semantic relatedness. In Proceedings of 18th International Joint Conferenceon Artificial Intelligence (IJCAI’03), Acapulco, Mexico, 2003.
[Benıtez et al., 1998] L. Benıtez, S. Cervell, G. Escudero, M. Lopez, G. Rigau, andM. Taule. Methods and tools for building the catalan wordnet. In Proceedings ofthe ELRA Workshop on Language Resources for European Minority Languages, FirstInternational Conference on Language Resources & Evaluation, Granada, Spain, 1998.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
[Daude et al., 1999] J. Daude, L. Padro, and G. Rigau. Mapping Multilingual HierarchiesUsing Relaxation Labeling. In Joint SIGDAT Conference on Empirical Methods inNatural Language Processing and Very Large Corpora (EMNLP/VLC’99), Maryland,US, 1999.
[Daude et al., 2000] J. Daude, L. Padro, and G. Rigau. Mapping WordNets Using Struc-tural Information. In Proceedings of 38th annual meeting of the Association for Compu-tational Linguistics (ACL’2000), Hong Kong, 2000.
[Daude et al., 2001] J. Daude, L. Padro, and G. Rigau. A complete wn1.5 to wn1.6 map-ping. In Proceedings of NAACL Workshop ”WordNet and Other Lexical Resources:Applications, Extensions and Customizations”, Pittsburg, PA, United States, 2001.
[Fellbaum, 1998] C. Fellbaum, editor. WordNet. An Electronic Lexical Database. The MITPress, 1998.
[Fernandez et al., 2004] J. Fernandez, M. Castillo, G. Rigau, J. Atserias, and J. Turmo.Automatic acquisition of sense examples using exretriever. In Proceedings of the 4thInternational Conference on Language Resources and Evaluation (LREC 2004), Lisbon,Portugal, 2004.
[Gangemi et al., 2003] A. Gangemi, R. Navigli, and P. Velardi. Axiomatizing wordnetglosses in the ontowordnet project. In Proceedings of 2nd International Semantic WebConference Workshop on Human Language Technology for the Semantic Web and WebServices, Sanibel Island, Florida, 2003.
[Guarino and Welty, 2000] Nicola Guarino and Christopher A. Welty. A formal ontology ofproperties. In Proceedings of ECAI’2000 Workshop on Knowledge Acquisition, Modelingand Management, pages 97–112, 2000.
[Hirst and St-Onge, 1998] G. Hirst and D. St-Onge. Lexical chains as representations ofcontext for the detection and correction of malapropisms. In WordNet: An ElectronicLexical Database and Some of its Applications, Editor C. Fellbaum. MIT Press, 1998.
[Jiang and Conrath, 1997] J. Jiang and D. Conrath. Semantic similarity based on cor-pus statistics and lexical taxonomy. In Proceedings of the International Conference onResearch in Computational Linguistics, Taiwan, 1997.
[Kipper et al., 2000] K. Kipper, H. Trang Dang, and M. Palmer. Class-based construc-tion of a verb lexicon. In AAAI-2000 Seventeenth National Conference on ArtificialIntelligence, 2000.
[Leacock and Chodorow, 1998] C. Leacock and M. Chodorow. Combining local contextand wordnet similarity for word sense indentification. In WordNet: An Electronic LexicalDatabase and Some of its Applications, Editor C. Fellbaum. MIT Press, 1998.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
[Lin, 1998a] D. Lin. Automatic retrieval and clustering of similar words. In Proceedings ofCOLING-ACL’1998, Montreal, Canada, 1998.
[Lin, 1998b] D. Lin. An information-theoretic definition of similarity. In Proceedigns of theinternational Conference on Machine Learning (ICML’98), Madison, Wisconsin USA,1998.
[Lyons, 1977] J. Lyons, editor. Semantics 1. Cambridge University Press, Cambridge, UK,1977.
[Magnini and Cavaglia, 2000] B. Magnini and G. Cavaglia. Integrating subject field codesinto wordnet. In In Proceedings of the Second Internatgional Conference on LanguageResources and Evaluation LREC’2000, Athens. Greece, 2000.
[McCarthy, 2001] D. McCarthy. Lexical Acquisition at the Syntax-Semantics Interface:Diathesis Aternations, Subcategorization Frames and Selectional Preferences. PhD the-sis, University of Sussex, 2001.
[Mihalcea and Moldovan, 2001] R. Mihalcea and D. Moldovan. Extended wordnet:Progress report. In Proceedings of NAACL Workshop on WordNet and Other LexicalResources, Pittsburgh, PA, 2001.
[Niles and Pease, 2001] I. Niles and A. Pease. Towards a standard upper ontology. InIn Proceedings of the 2nd International Conference on Formal Ontology in InformationSystems (FOIS-2001), pages 17–19. Chris Welty and Barry Smith, eds, 2001.
[Patwardhan, 2003] S. Patwardhan. Incorporating dictionary and corpus information intoa context vector measure of semantic relatedness. Master’s thesis, University of Min-nesota, Duluth, 2003.
[Pianta et al., 2002] E. Pianta, L. Bentivogli, and C. Girardi. Multiwordnet: developingan aligned multilingual database. In First International Conference on Global WordNet,Mysore, India, 2002.
[Resnik, 1995] P. Resnik. Using information content to evaluate semantic similarity in ataxonomy. In Proceedings of 14th International Joint Conference on Artificial Intelli-gence (IJCAI’95), Montreal, Canada, 1995.
[Sofia et al., 2002a] S. Sofia, N. Alexandros, H. Jeroen, S. Maximiliano, and C. Dimitris.Extending the eurowordnet with domain- specific terminology using an expand modelapproach. In Proceedings of the 1st Global WordNet Association conference, Mysore,India, 2002.
[Sofia et al., 2002b] S. Sofia, O. Kemal, P. Karel, C. Dimitris, C. Dan, T. Dan, K. Svetla,T. George, D. Dominique, and G. Maria. Balkanet: A multilingual semantic network forthe balkan languages. In Proceedings of the 1st Global WordNet Association conference,2002.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
[Vossen, 1998] P. Vossen, editor. EuroWordNet: A Multilingual Database with LexicalSemantic Networks . Kluwer Academic Publishers , 1998.
[Wagner, 2002] A. Wagner. Learning thematic role relations for wordnets. In Proceedings ofESSLLI-2002 Workshop on Machine Learning Approaches in Computational Linguistics,Trento, Italy, 2002.
[Wu and Palmer, 1994] Z. Wu and M. Palmer. Verb semantics and lexical selection. InProceedings of the 32nd Annual Meeting of the Association for Computational Linguistics(ACL’94), Las cruces, New Mexico, 1994.
IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies