A Framework for Semantic Mapping between Thesauri E. Francesconi, S. Faro, E. Marinai ITTIG-CNR – Institute of Legal Information Theory and Techniques Italian National Research Council ICEGOV 08 - Cairo, Egypt, December 1 st –4 th 2008 Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Framework for Semantic Mappingbetween Thesauri
E. Francesconi, S. Faro, E. Marinai
ITTIG-CNR – Institute of Legal Information Theory and TechniquesItalian National Research Council
ICEGOV 08 - Cairo, Egypt, December 1st–4th 2008
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Overview
Introduction to the Thesaurus Interoperability problem
Overview of thesaurus mapping modalities
Our formal characterization given to the thesaurus mappingproblem
Interopearbility workflow and standards– Thesaurus Mapping algorithms implementation
– The “Gold Standard” data set and the THALEN application
Thesaurus interoperability assessment
Experimental results and conclusions
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Rationale
Problem of accessing heterogeneous data sources in adistributed environment;
Terminological resources (thesauri) or ontologies canguarantee a better quality in document indexing and retrieval;
Cross-collections retrieval:
– providing queries from a single interface using a specificthesaurus as support (where available), and retrievingpertinent documents from different collections.
Quality of retrieval in single collections
– linked to availability of specific thesauri
Quality of retrieval in cross-collections
– linked to interoperability among thesauri
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Interoperability among Thesauri
Using a particular thesaurus for querying a collectionMapping this thesaurus:
– to thesauri in other languages– to more specialized vocabularies– to different versions of the thesaurus
to obtain a retrieval from different collections which iscoherent to the original query
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Interoperability among Thesauri: the case study
EUROVOC the main EU thesaurus considering issues ofspecific and common interest for the EU and its MemberStates
ECLAS the European Commission Central Libraries thesaurus
GEMET GEneral Multilingual Environmental Thesaurus
UNESCO Thesaurus developed by the United NationsEducational, Scientific and Cultural Organisation
European Training Thesaurus (ETT) a thesaurus providingsupport to indexing and retrieval vocational education andtraining documentation in the European Union
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Thesaurus Mapping (TM)
DefinitionThe process of identifying terms, concepts and hierarchicalrelationships that are approximately equivalent between thesauri
The problem is moved to the definition of concept equivalence
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Concept equivalence
Definition (Instance-based equivalence)
Two concepts are deemed to be equivalent if they are associatedwith, or classify the same set of objects
Definition (Schema-based equivalence)
Two concepts are deemed to be equivalent if there exists asimilarity among their features
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Identification of the problem characteristics
Thesaurus mapping for the project case study is a problem ofterm alignments, where only schema information is available
It is a problem where to measure the conceptual / semanticsimilarity between a term (simple or complex) in the sourcethesaurus and candidate terms in a target thesaurus(Schema-based matching)
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Our proposal for Thesaurus Mapping formal characterization
We have proposed to characterize the problem of ThesaurusMapping (TM) as a problem of Information Retrieval (IR)
In IR the aim is to find the documents, in a documentcollection, better matching the semantics of a query
Similarly, in TM the aim is to find the terms, in a termcollection (target thesaurus), better matching the semantics ofa given term in a source thesaurus
TM IRTerm in source thesaurus ⇐⇒ QueryTerm in target thesaurus ⇐⇒ Document
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Our TM formal characterization
DefinitionWe propose to characterize TM as a 4-upla [D, Q, F , R(qi , dj)]where:
D is a the set of possible representations (logical views) of a term ina target thesaurus (in IR documents in a collection)Q is the set of the possible representations (logical views) of a termin a source thesaurus (in IR queries to be matched with documentsof the collections)F is the framework of term representations in source and targetthesauriR(qi , dj) is the ranking function, which associates a real number toa (qi , dj) where qi ∈ Q , dj ∈ D, giving an order of relevance to theterms in a target thesaurus dj with respect to a term of the sourcethesaurus qi
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Isomorphism between TM and IR
TM ⇐⇒ IR
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Term lexical manifestation and semantics
Different terms can be used to identify the same conceptin the same language (e.g. ‘pollution’, ‘contamination’,‘discharge of pollutants’);in different languages (e.g. EUROVOC EN term ‘water’ andIT term ‘acqua’)
TM should aim at matching term meanings (the semantics of theterms) rather than formal (lexical) manifestations
HypothesisThe more terms are semantically characterized, the more thesystem will be able to match them according to their meanings
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
The proposed Logical Views of terms in source (Q) andtarget (D) thesauri
Representation of terms semantics
The semantics of a term is conveyed by:1 its morphological characteristics2 the context in which the term is used3 the relations with other terms
We have proposed to represent the semantics of a term in athesaurus by:
1 its Lexical Manifestation: strings (pre-processed strings)2 its Lexical Context: vector of weighted/binary terms (the term itself and
other related terms)3 its Lexical Network: graph of terms (nodes are terms and labeled edges
are relations)
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
A Lexical Manifestation
(Stemmed variation)
Parliamentary committees → Parliament$ committee$
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
A Lexical Context
A Lexical Context is a vector ~d of binary/weighted terms[w1, . . . , w|T |], where T is the dimension of a target thesaurusvocabulary
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
A Lexical Network
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
The proposed Ranking Functions (R)
1 Lexical Manifestation: Levenshtein Distance/Similarity(normilized minimum number of operations (insertion, deletion or substitution
of a single character) needed to transform one string into another).
2 Lexical Context: Cosine Distance/Similarity
3 Lexical Network: Graph Edit Distance/Similarity
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Standards for Interoperability Environment
The interoperability environment is based on RDF standardsfor thesaurus description and mapping:
– SKOS Core
– SKOS Mapping (exactMatch, partial match (broadMatch,narrowMatch))
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Interoperability Workflow
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Workflow
1 SKOS Core transformation of each thesaurus (XSLTtechnologies)
2 Thesaurus term pre-processing3 Thesaurus term representation
4 Thesaurus term candidate selection for mappingLevenshtein Distance/Similarity (for the Lexical Manifestation)Cosine Distance/Similarity (for the Lexical Context)Graph Edit Distance/Similarity (for the Lexical Network)
5 Ranking among candidate terms and mapping implementationif sim < T1 ⇒ No Matchif T1 < sim < T2 ⇒ partial match (broadMatch or narrowMatch)if T2 < sim ⇒ exactMatch
6 Representation of the semantics of mapping in SKOS MappingFrancesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
“Gold Standard”
The Gold Standard is a groundtruth of thesauri term mappingexamples
In other words it is the ideal set of expected correct mappings,which the system predictions will be compared to
It is aimed at:
tuning heuristics (performance convergence)
evaluating the performances of automatic mapping algorithms
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
“Gold Standard” creation
The “Gold Standard” is composed by thesauri descriptors usingEnglish as pivot language
Mapping relations are described using SKOS Mapping(exactMatch, broadMatch and narrowMatch relations)
According to the project requirements, mappings betweenEUROVOC (pivot) and other thesauri have been established
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Harmonizing criteria for “Gold Standard” creation
English as pivot language
exactMatch: a concept in EUROVOC corresponds exactly toone or more concepts in a target thesaurus according to theexpert judgment
broadMatch/narrowMatch: for each broad/narrow match thatcan be established the narrowest/broadest (the more/lessspecific) of all the broad/narrow matches has to be chosen:Complete and Optimal mapping [Liang and Sini, 2006][Doerr, 2001]
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
The chosen solution for “Gold Standard” implementation
Specific application for a user-friendly access to thesauri andsimple functionalities for thesauri alignment
The application for THesauri ALigning ENvironment(THALEN) is based on MS Access relational database
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Main functionalities of the THesauri ALigning ENvironment(THALEN)
login/logout
thesaurus loading
parallel view of two thesauri
search modalities:
– term browsing– term searching (using specific properties (descriptors, fields,
etc.) or full text searching)
mapping (term selection, choice of mapping relations, etc.)
summary of established term mappings
exporting RDF SKOS mapping relations
Side-use of the application: human validation of the automaticmapping
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
A THALEN screenshot (thesauri parallel view and mapping)
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
The “Gold Standard” data set
Number of “Gold Standard” relations 624
they include 346 exactMatch relations
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Interoperability Assessment
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Interopearbility Assessment
Assessment on the “Glod Standard” data set
Automatic mapping as support of the activities of an editorialstaff of experts, cooperating in the identification of matchingconcepts
The system Recall has been assessed since the automaticmapping is addressed to identify matching concepts within thesystem predictions, to be validated by humans.
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Experimental Results
The proposed logical views for thesaurus terms and rankingfunctions outperformed a simple string matching
Best results for each thesauri couples
For EUROVOC vs. {ETT, ECLAS, GEMET}Lexical Manifestation logical viewandLevenshtein Similarity ranking function(untypedMatch Recall = 66.2%, exactMatch Recall = 82.3%)For EUROVOC vs. UNESCO ThesaurusLexical Network logical viewandConceptual Similarity ranking function(untypedMatch Recall = 73.7%, exactMatch Recall = 80.8%)
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Conclusions
We have presented a methodological framework and a specificimplementation of schema-based thesaurus mapping
Terms logical views and related ranking functions for matching have beenproposed and tested.
The Lexical Manifestation logical view and Levenshtein Similarity rankingfunction produced the best results on most cases.
More complex descriptions (Lexical Contexts, Lexical Networks) sufferfrom problems of computational tractability
Different criteria of features selection can be tested to reduce
the computational complexity
the variability of the similarity measures
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri
Doerr, M. (2001).Semantic problems of thesaurus mapping.Journal of Digital Information, 1(8).
Liang, A. C. and Sini, M. (2006).Mapping AGROVOC and the Chinese Agricultural Thesaurus:Definitions, tools, procedures.New Review of Hypermedia and Multimedia, 12(1):51–62.
Francesconi, Faro, Marinai A Framework for Semantic Mapping between Thesauri