Top Banner
A Framework for Semantic Mapping between Thesauri E. Francesconi, S. Faro, E. Marinai ITTIG-CNR – Institute of Legal Information Theory and Techniques Italian National Research Council ICEGOV 08 - Cairo, Egypt, December 1 st –4 th 2008 Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
33

A framework for semantic mapping between thesauri

Jan 21, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A framework for semantic mapping between thesauri

A Framework for Semantic Mappingbetween Thesauri

E. Francesconi, S. Faro, E. Marinai

ITTIG-CNR – Institute of Legal Information Theory and TechniquesItalian National Research Council

ICEGOV 08 - Cairo, Egypt, December 1st–4th 2008

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 2: A framework for semantic mapping between thesauri

Overview

Introduction to the Thesaurus Interoperability problem

Overview of thesaurus mapping modalities

Our formal characterization given to the thesaurus mappingproblem

Interopearbility workflow and standards– Thesaurus Mapping algorithms implementation

– The “Gold Standard” data set and the THALEN application

Thesaurus interoperability assessment

Experimental results and conclusions

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 3: A framework for semantic mapping between thesauri

Rationale

Problem of accessing heterogeneous data sources in adistributed environment;

Terminological resources (thesauri) or ontologies canguarantee a better quality in document indexing and retrieval;

Cross-collections retrieval:

– providing queries from a single interface using a specificthesaurus as support (where available), and retrievingpertinent documents from different collections.

Quality of retrieval in single collections

– linked to availability of specific thesauri

Quality of retrieval in cross-collections

– linked to interoperability among thesauri

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 4: A framework for semantic mapping between thesauri

Interoperability among Thesauri

Using a particular thesaurus for querying a collectionMapping this thesaurus:

– to thesauri in other languages– to more specialized vocabularies– to different versions of the thesaurus

to obtain a retrieval from different collections which iscoherent to the original query

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 5: A framework for semantic mapping between thesauri

Interoperability among Thesauri: the case study

EUROVOC the main EU thesaurus considering issues ofspecific and common interest for the EU and its MemberStates

ECLAS the European Commission Central Libraries thesaurus

GEMET GEneral Multilingual Environmental Thesaurus

UNESCO Thesaurus developed by the United NationsEducational, Scientific and Cultural Organisation

European Training Thesaurus (ETT) a thesaurus providingsupport to indexing and retrieval vocational education andtraining documentation in the European Union

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 6: A framework for semantic mapping between thesauri

Thesaurus Mapping (TM)

DefinitionThe process of identifying terms, concepts and hierarchicalrelationships that are approximately equivalent between thesauri

The problem is moved to the definition of concept equivalence

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 7: A framework for semantic mapping between thesauri

Concept equivalence

Definition (Instance-based equivalence)

Two concepts are deemed to be equivalent if they are associatedwith, or classify the same set of objects

Definition (Schema-based equivalence)

Two concepts are deemed to be equivalent if there exists asimilarity among their features

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 8: A framework for semantic mapping between thesauri

Identification of the problem characteristics

Thesaurus mapping for the project case study is a problem ofterm alignments, where only schema information is available

It is a problem where to measure the conceptual / semanticsimilarity between a term (simple or complex) in the sourcethesaurus and candidate terms in a target thesaurus(Schema-based matching)

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 9: A framework for semantic mapping between thesauri

Our proposal for Thesaurus Mapping formal characterization

We have proposed to characterize the problem of ThesaurusMapping (TM) as a problem of Information Retrieval (IR)

In IR the aim is to find the documents, in a documentcollection, better matching the semantics of a query

Similarly, in TM the aim is to find the terms, in a termcollection (target thesaurus), better matching the semantics ofa given term in a source thesaurus

TM IRTerm in source thesaurus ⇐⇒ QueryTerm in target thesaurus ⇐⇒ Document

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 10: A framework for semantic mapping between thesauri

Our TM formal characterization

DefinitionWe propose to characterize TM as a 4-upla [D, Q, F , R(qi , dj)]where:

D is a the set of possible representations (logical views) of a term ina target thesaurus (in IR documents in a collection)Q is the set of the possible representations (logical views) of a termin a source thesaurus (in IR queries to be matched with documentsof the collections)F is the framework of term representations in source and targetthesauriR(qi , dj) is the ranking function, which associates a real number toa (qi , dj) where qi ∈ Q , dj ∈ D, giving an order of relevance to theterms in a target thesaurus dj with respect to a term of the sourcethesaurus qi

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 11: A framework for semantic mapping between thesauri

Isomorphism between TM and IR

TM ⇐⇒ IR

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 12: A framework for semantic mapping between thesauri

Term lexical manifestation and semantics

Different terms can be used to identify the same conceptin the same language (e.g. ‘pollution’, ‘contamination’,‘discharge of pollutants’);in different languages (e.g. EUROVOC EN term ‘water’ andIT term ‘acqua’)

TM should aim at matching term meanings (the semantics of theterms) rather than formal (lexical) manifestations

HypothesisThe more terms are semantically characterized, the more thesystem will be able to match them according to their meanings

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 13: A framework for semantic mapping between thesauri

The proposed Logical Views of terms in source (Q) andtarget (D) thesauri

Representation of terms semantics

The semantics of a term is conveyed by:1 its morphological characteristics2 the context in which the term is used3 the relations with other terms

We have proposed to represent the semantics of a term in athesaurus by:

1 its Lexical Manifestation: strings (pre-processed strings)2 its Lexical Context: vector of weighted/binary terms (the term itself and

other related terms)3 its Lexical Network: graph of terms (nodes are terms and labeled edges

are relations)

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 14: A framework for semantic mapping between thesauri

A Lexical Manifestation

(Stemmed variation)

Parliamentary committees → Parliament$ committee$

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 15: A framework for semantic mapping between thesauri

A Lexical Context

A Lexical Context is a vector ~d of binary/weighted terms[w1, . . . , w|T |], where T is the dimension of a target thesaurusvocabulary

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 16: A framework for semantic mapping between thesauri

A Lexical Network

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 17: A framework for semantic mapping between thesauri

The proposed Ranking Functions (R)

1 Lexical Manifestation: Levenshtein Distance/Similarity(normilized minimum number of operations (insertion, deletion or substitution

of a single character) needed to transform one string into another).

2 Lexical Context: Cosine Distance/Similarity

3 Lexical Network: Graph Edit Distance/Similarity

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 18: A framework for semantic mapping between thesauri

Standards for Interoperability Environment

The interoperability environment is based on RDF standardsfor thesaurus description and mapping:

– SKOS Core

– SKOS Mapping (exactMatch, partial match (broadMatch,narrowMatch))

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 19: A framework for semantic mapping between thesauri

Interoperability Workflow

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 20: A framework for semantic mapping between thesauri

Workflow

1 SKOS Core transformation of each thesaurus (XSLTtechnologies)

2 Thesaurus term pre-processing3 Thesaurus term representation

Lexical ManifestationLexical ContextLexical Network

4 Thesaurus term candidate selection for mappingLevenshtein Distance/Similarity (for the Lexical Manifestation)Cosine Distance/Similarity (for the Lexical Context)Graph Edit Distance/Similarity (for the Lexical Network)

5 Ranking among candidate terms and mapping implementationif sim < T1 ⇒ No Matchif T1 < sim < T2 ⇒ partial match (broadMatch or narrowMatch)if T2 < sim ⇒ exactMatch

6 Representation of the semantics of mapping in SKOS MappingFrancesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 21: A framework for semantic mapping between thesauri

Interoperability assessment throughthesaurus mapping “Gold Standard”

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 22: A framework for semantic mapping between thesauri

“Gold Standard”

The Gold Standard is a groundtruth of thesauri term mappingexamples

In other words it is the ideal set of expected correct mappings,which the system predictions will be compared to

It is aimed at:

tuning heuristics (performance convergence)

evaluating the performances of automatic mapping algorithms

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 23: A framework for semantic mapping between thesauri

“Gold Standard” creation

The “Gold Standard” is composed by thesauri descriptors usingEnglish as pivot language

Mapping relations are described using SKOS Mapping(exactMatch, broadMatch and narrowMatch relations)

According to the project requirements, mappings betweenEUROVOC (pivot) and other thesauri have been established

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 24: A framework for semantic mapping between thesauri

Harmonizing criteria for “Gold Standard” creation

English as pivot language

exactMatch: a concept in EUROVOC corresponds exactly toone or more concepts in a target thesaurus according to theexpert judgment

broadMatch/narrowMatch: for each broad/narrow match thatcan be established the narrowest/broadest (the more/lessspecific) of all the broad/narrow matches has to be chosen:Complete and Optimal mapping [Liang and Sini, 2006][Doerr, 2001]

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 25: A framework for semantic mapping between thesauri

The chosen solution for “Gold Standard” implementation

Specific application for a user-friendly access to thesauri andsimple functionalities for thesauri alignment

The application for THesauri ALigning ENvironment(THALEN) is based on MS Access relational database

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 26: A framework for semantic mapping between thesauri

Main functionalities of the THesauri ALigning ENvironment(THALEN)

login/logout

thesaurus loading

parallel view of two thesauri

search modalities:

– term browsing– term searching (using specific properties (descriptors, fields,

etc.) or full text searching)

mapping (term selection, choice of mapping relations, etc.)

summary of established term mappings

exporting RDF SKOS mapping relations

Side-use of the application: human validation of the automaticmapping

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 27: A framework for semantic mapping between thesauri

A THALEN screenshot (thesauri parallel view and mapping)

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 28: A framework for semantic mapping between thesauri

The “Gold Standard” data set

Number of “Gold Standard” relations 624

they include 346 exactMatch relations

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 29: A framework for semantic mapping between thesauri

Interoperability Assessment

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 30: A framework for semantic mapping between thesauri

Interopearbility Assessment

Assessment on the “Glod Standard” data set

Automatic mapping as support of the activities of an editorialstaff of experts, cooperating in the identification of matchingconcepts

The system Recall has been assessed since the automaticmapping is addressed to identify matching concepts within thesystem predictions, to be validated by humans.

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 31: A framework for semantic mapping between thesauri

Experimental Results

The proposed logical views for thesaurus terms and rankingfunctions outperformed a simple string matching

Best results for each thesauri couples

For EUROVOC vs. {ETT, ECLAS, GEMET}Lexical Manifestation logical viewandLevenshtein Similarity ranking function(untypedMatch Recall = 66.2%, exactMatch Recall = 82.3%)For EUROVOC vs. UNESCO ThesaurusLexical Network logical viewandConceptual Similarity ranking function(untypedMatch Recall = 73.7%, exactMatch Recall = 80.8%)

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 32: A framework for semantic mapping between thesauri

Conclusions

We have presented a methodological framework and a specificimplementation of schema-based thesaurus mapping

Terms logical views and related ranking functions for matching have beenproposed and tested.

The Lexical Manifestation logical view and Levenshtein Similarity rankingfunction produced the best results on most cases.

More complex descriptions (Lexical Contexts, Lexical Networks) sufferfrom problems of computational tractability

Different criteria of features selection can be tested to reduce

the computational complexity

the variability of the similarity measures

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri

Page 33: A framework for semantic mapping between thesauri

Doerr, M. (2001).Semantic problems of thesaurus mapping.Journal of Digital Information, 1(8).

Liang, A. C. and Sini, M. (2006).Mapping AGROVOC and the Chinese Agricultural Thesaurus:Definitions, tools, procedures.New Review of Hypermedia and Multimedia, 12(1):51–62.

Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri