Marrakech, Morocco LREC 2008 Ontology Learning and Semantic Annotation: a necessary symbiosis Emiliano Giovannetti, Simone Marchi, Simonetta Montemagni, Roberto Bartolini ILC-CNR, Pisa, Italy
Dec 28, 2015
Marrakech, MoroccoLREC 2008
Ontology Learning and Semantic Annotation: a necessary symbiosis
Emiliano Giovannetti, Simone Marchi,
Simonetta Montemagni, Roberto Bartolini
ILC-CNR, Pisa, Italy
LREC 2008 Marrakech, Morocco
• Technologies in the area of knowledge management and information access are confronted with a typical acquisition paradox
– access to content requires understanding the linguistic structures representing it in text at a level of considerable detail
– processing linguistic structures at the depth needed for content understanding presupposes that a considerable amount of domain knowledge is already in place
The knowledge acquisition paradox
LREC 2008 Marrakech, Morocco
The knowledge acquisition paradox
corpus
ontology as a formal representation of
domain knowledge
semantically annotated text
advanced linguistic
annotationneeds an ontology
ontology learning needs a
linguistically- annotated
corpus
LREC 2008 Marrakech, Morocco
Turning a vicious circle into a “virtuous circle”
Text(implicit knowledge)
Structured content(explicit knowledge)
Dynamic Content
Structuring
KnowledgeExtraction
Linguistic annotation
LREC 2008 Marrakech, Morocco
Turning a vicious circle into a “virtuous circle”: a first step
Text(implicit knowledge)
Structured content(explicit knowledge)
Dynamic Content
Structuring
Terminology extraction and
structuring
Syntactic
parsing
ontology
LREC 2008 Marrakech, Morocco
Turning a vicious circle into a “virtuous circle”: a first step
Text(implicit knowledge)
Structured content(explicit knowledge)
Dynamic Content
Structuring
Domain entity and relation extraction
ontology-driven semantic
annotation
Syntactic
parsing
LREC 2008 Marrakech, Morocco
A case study:semantic annotation of product catalogues
the challenge• product descriptions appear as
semi-structured texts, also including portions of running text
• product catalogues do not contain continuous and linguistically sound text (typically, nominal descriptions)
• this task requires the combination of different types of evidence and techniques
LREC 2008 Marrakech, Morocco
The system for ontology-based semantic annotation of product catalogues
input catalogue
ontology
NLP Modules
Tokenizer Morpho Analyzer
Chunker Dependency
Parser
Product cataloguesItalianSemanticAnnotator
<entity data_id="26"> <name>SANELA</name></entity><entity data_id=“33"> <part>fodera</part></entity><entity data_id=“34"> <material>cotone</material></entity>
semantic annotation of
product descriptions
semantic annotation component
ontology learning componentProduct cataloguesTerminologyProcessor
Product cataloguesItalianSemanticAnnotator
Product cataloguesTerminologyProcessor
LREC 2008 Marrakech, Morocco
The Product catalogues Terminology Processor (PTP) for Ontology Learning
LegDoorTopFrameElementShelfPartSliding doorCoverSupportDrawer
domain domain terminologyterminology
Term ExtractionTerm ExtractionSemantic StructuringSemantic Structuring
Customised version of T2K (Text-to-Knowledge), a hybrid system combining linguistic technologies and statistical techniques
LREC 2008 Marrakech, Morocco
PTP: semantic structuring – identification of relations
Horizontal relations
identified on the basis of dynamic distributionally-based similarity measures
Vertical relations
identified on the basis of head-sharing
LREC 2008 Marrakech, Morocco
First step: semantic structuring - clustering
colour
material
definition of root conceptsdefinition of root conceptsdefinition of sub-conceptsdefinition of sub-concepts
bianco
beige
scuro
grigio
blu
rosso
acciaio
pino
betulla
alluminio
rovere
plastica
faggiovetro
is_ais_a
is_a
is_a
is_a
is_a
is_ais_a
is_a
is_a
is_a
is_a
is_ais_a
LREC 2008 Marrakech, Morocco
PTP: the final ontology
steel wood
material part colour
door blue
stainless steel
solid wood
light blue
base
sliding door
hasPartColourhasPartMaterial
isa isa
isa isa
isa isa
isa isa
isa
LREC 2008 Marrakech, Morocco
Semantic annotation: the approachpattern matching + NLP
pattern matching: resorted to for isolating individual product descriptions within the textual flow and for identifying their basic building blocks
ontology-driven NLP: for each identified product, the NL description is processed by a battery of NLP tools in charge of identifying relevant entities (e.g. color, material, parts of a given product) and the relations holding between them (e.g. part_of, color_of)
LREC 2008 Marrakech, Morocco
Product catalogues Italian Semantic Annotator (PISA):ontology driven semantic annotation
input catalogue
ontology
NLP tools PISA
RegExp Manager
NLP Manager
domain entities– product– part– name– id– type– category– material– color– price– height– width– depth– weight– diameter
relations between identified entities– part_of ( product part )– name_of ( (product | series) name )– id_of ( product id )– type_of ( product type )– category_of ( (product | series | part) category )– made_of ( (product | series | part) material )– color_of ( (product | series | part) color )– price_of ( product price )– height_of ( product height )– width_of ( product width )– depth_of ( product depth )– weight_of ( product weight )– diameter_of ( product diameter )
LREC 2008 Marrakech, Morocco
PISA:semantic annotation - pattern matching
([A-Z]{3,}\s)+(.+)?(€[\d,\/\spz]+\.)([\w|\s|\.]+)(Cm\s\d{1,3}.\d{1,3}\.)(\d{3}\.\d{3}\.\d{2})
name type price
description
dimensions product id
name type price description dimensions product id
to be processed by the NLP manager to extract entities and relations about: parts, materials, colours, etc.
LREC 2008 Marrakech, Morocco
PISA: ontology for semantic annotation (entity recognition)
hasPart
glass wood
material part product
door table
tempered glass
solid wood
base
sliding door
hasPartMaterial
isa isa
isa isa
isa isa
isa
isa
[ [ CC: N_C] [ AGR: @FP] [ POTGOV: ANTA#S@FP]][ [ CC: P_C] [ AGR: @MS] [ PREP: IN#E] [ POTGOV: VETRO__TEMPRARE|TEMPRATO#S@MS]][ [ CC: PUNC_C] [ PUNCTYPE: .#@]]{. }
LREC 2008 Marrakech, Morocco
“Sedia in plastica con schienale regolabile” (plastic chair with adjustable back)
??
hasPart
vetro plastica
materiale parte prodotto
schienale sedia
vetro temprato
base
bevel edged plate
isa isa
isa
isa isa
isa
isa
schienale regolabile
sediaplastica
Where to attach “schienale regolabile”: - to “sedia” or to “plastica”?
“sedia” is a kind of “prodotto”
“plastica” is a kind of “materiale”
schienale regolabile
“schienale regolabile” is a kind of “parte”
There is no property linking Material to a Part, but there is one linking a Product to a Part so the correct interpretation is that “schienale regolabile” is a part of “sedia”.
PISA: ontology for semantic annotation (relation extraction)
Evaluation of acquired results•Preliminary evaluation was carried out:
•“task based” evaluation concerning the ontology learning component:
•provided in terms of correctness in supporting semantic annotation
•evaluation of the semantic annotation component:
•a “gold-standard” corpus of reference was created by randomly extracting and manually annotating about 100 IKEA products.
number of correct annotations
number of partially correct annotations
ACT
PARCORPRE
5.0
precision
POS
PARCORREC
5.0
recalltotal number of annotation (correct+incorrect+partially
correct)
total number of annotations in the gold-standard (correct+partially
correct+missing)
Sem. annotation precision recall F-measure
pattern matching 0,99 0,94 0,96
ontology driven liguistic analysis
0,89 0,70 0,78
RECPRE
RECPREF
**2
F-measure
Further directions of research– system portability to other product catalogues:
• “Zanotta” furniture catalogue– subset of 30 product descriptions extracted as a “gold-standard” of
reference and manually annotated
Sem. annotation precision recall F-measure
pattern matching 1 0,86 0,92
ontology driven liguistic analysis
1 0,50 0,66
• product catalogues in other domains
– application of the methodology to other domains and to non-structured (free) corpora
– more steps towards the triggering of the “virtuous circle”:• next step: exploiting the results obtained from the semantic annotation
to enrich the ontology