The DERI Reading Group Ontology-based information extraction: An Overview & Survey (2010, Wimalasuriya and Dou) Tobias Wunner, UNLP Group Copyright 2010 Digital Enterprise Research Institute. All rights
May 10, 2015
The DERI Reading Group
Ontology-based information extraction: An Overview & Survey
(2010, Wimalasuriya and Dou)
Tobias Wunner, UNLP Group
Copyright 2010 Digital Enterprise Research Institute. All rights reserved, Paul Buitelaar
Definition - Motivation
a) Create content for the Semantic Web
convert existing websites into ontologies
b) Improve quality of existing ontologies
Test criterion: OBIE task
OBIE good => ontology good
Overview
Access to information…
Overview
Access to information…
Ontologie-based Information Extraction (OBIE):
“A system that processes unstructured or semi- structured natural language text guided by an ontology and presents the output in an ontology.
Overview
ESWC dogfood OBIE-related topics
New!
T
1. Text only: Extract conceptualization and instances
Countybuilding with café and football table
Building
is-a
1. conceptualization
2. instances
Galway DERI building
Problem – two scenarios
TCounty
building with café and football table
Building
is-a
1. conceptualization
2. instances
Galway DERI building
Problem – two scenarios
conceptualization can be too specific / generic
wrong conceptualization
1. Text only: Extract conceptualization and instances
TCity Building located
in
Conceptualization by domain ontology
2. instances
Galway DERI building
Problem – two scenarios
2. Domain ontology & text: extract instances only
TCity Building located
in
Conceptualization by domain ontology
2. instances
Galway DERI building
Problem – two scenarios
2. Domain ontology & text: extract instances only
less generic but more semantic stable
Definition – key characteristics
a) Process structured / unstructured text
b) “guided” by an ontology
c) Present output in ontology
TextSource
InformationExtractor
Ontology
guided by
Definition – ontology learning or population?
Ontology population ⊂ OBIE
“OBIE is Open information extraction”
(Etzioni)
alternative: semantics given by ontology!
extractors can be inside / outside ontology
TextSource
InformationExtractor
Ontology
guided by
Methods
Information extractors1. Linguistic rules
2. Gazetteer lists
3. Classification (classical / structure-aware)
4. Partial parse trees
5. Structured data analyzers
6. Web querying
Linguistic Rules - Methods
Regular expressions <COMPANY> .* revenue <Number>
<currency>
“Tesco’s revenue in 2009 was 3.4 billion GBP.”
Extraction ontologies combination of ontology and lexicon
(Mädche, Embley, Buitelaar) manual construction High precision
2. Gazetteer lists Phrases / words instead of patterns Named-Entity Recognition Requirements:
1) Specify what is being extracted
2) Specify sources and avoid manual creation
Gazetteer Methods
Sematic WebSoftwareEnergySupermarket…
industry
The software giant SAP…Tesco a UK supermarket …
Siemens energy revenue…… wind energy company Vestas
3. Classification techniques Break down IE task in a set of binary
tasks
Classification Methods
possemTag
c1
c2
..cn
Classifier
features
Classical
Classification Methods
Galway Germany DERI Siemens
GEIrelandMunich CITECmissclassification doesnot consider structure!(equal cost 1/6)
DERI
TescoCladdagh
DERI
CountryCity SW Energy
IndustryLocation
W1,6=3 Structure aware
Classification Methods
Galway Germany Siemens
GEIrelandMunich CITECClassifier shouldconsider taxonomy structure!
TescoCladdagh
DERI
4. Partial parse trees TACITUS, SMES, LTAG
5. Analyze structured data Wikpedia Infoboxes
6. Web querying C-PANKOW
“Towards the self annotating web
Other methods
Technologies used in implementation
Shallow NLP (GATE, sProUT, StanfordNLP)
POS, sentence splitting, regular expression
Semantic lexicons (WordNet, GermaNet) synonym, meronym, hypernym
Semantic Annotation (OCAT, iDocument, PIMO)
Missing Terminological tools (UMLS,
bio terminologies) Thesauri, translation
memory
Data sets & evaluation
Data sets (corpora)1) Message Understanding Conference (MUC-
7)2) Automatic Content Extraction (ACE) => more on classical IR, IE, NLP tracks => no data set with given semantics
(ontology)
Evaluation Precision & recall
Only used for population task
Recent Open IE argument
Con: Weikum, From Information to Knowledge -Harvest Web Resources for IE
Disambiguation NL relations are not well defined (well
defined arguments)
Pro: Weld, Using Wiki to Bootrap Open IE
Relation targeted: learn extractor per relation -> lower recall
Structural targeted: general extraction engine -> lower precision
Conclusion and Outlook
No established/ agreed methods yet Is OBIE also ontology learning? Data sets Methods for best extractors
Semantic Web contribution? eg. Gazetteers from DBPedia
Cross-lingual OBIE -> CLOBIE
References
[1] Wimalasuriya, Dou, Ontology-based Information Extraction: An Introduction and Survey of current approaches, in Journal of Computer Science, June 2010
[2] Buitelaar et Al., Towards linguistically grounded ontologies., ESWC, Springer, 200
[3] Weikum et Al, From Information to Knowledge – Harvesting Entities and Relationships from Web Sources, Principle Database Systems, 2010
[4] Weld et al., Using Wikipedia to bootstrap open information extraction, Sigmod Record, 2008