Ontology-based information extraction in the DERI Reading Group

The DERI Reading Group

Ontology-based information extraction: An Overview & Survey

(2010, Wimalasuriya and Dou)

Tobias Wunner, UNLP Group

Copyright 2010 Digital Enterprise Research Institute. All rights reserved, Paul Buitelaar

Definition - Motivation

a) Create content for the Semantic Web

convert existing websites into ontologies

b) Improve quality of existing ontologies

Test criterion: OBIE task

OBIE good => ontology good

Overview

Access to information…

Overview

Access to information…

Ontologie-based Information Extraction (OBIE):

“A system that processes unstructured or semi- structured natural language text guided by an ontology and presents the output in an ontology.

Overview

ESWC dogfood OBIE-related topics

New!

T

1. Text only: Extract conceptualization and instances

Countybuilding with café and football table

Building

is-a

1. conceptualization

2. instances

Galway DERI building

Problem – two scenarios

TCounty

building with café and football table

Building

is-a

1. conceptualization

2. instances



conceptualization can be too specific / generic

wrong conceptualization

1. Text only: Extract conceptualization and instances

TCity Building located

in

Conceptualization by domain ontology

2. instances



2. Domain ontology & text: extract instances only

TCity Building located

in

Conceptualization by domain ontology

2. instances



2. Domain ontology & text: extract instances only

less generic but more semantic stable

Definition – key characteristics

a) Process structured / unstructured text

b) “guided” by an ontology

c) Present output in ontology

TextSource

InformationExtractor

Ontology

guided by

Definition – ontology learning or population?

Ontology population ⊂ OBIE

“OBIE is Open information extraction”

(Etzioni)

alternative: semantics given by ontology!

extractors can be inside / outside ontology

TextSource

InformationExtractor

Ontology

guided by

Methods

Information extractors1. Linguistic rules

2. Gazetteer lists

3. Classification (classical / structure-aware)

4. Partial parse trees

5. Structured data analyzers

6. Web querying

Linguistic Rules - Methods

Regular expressions <COMPANY> .* revenue <Number>

<currency>

“Tesco’s revenue in 2009 was 3.4 billion GBP.”

Extraction ontologies combination of ontology and lexicon

(Mädche, Embley, Buitelaar) manual construction High precision

2. Gazetteer lists Phrases / words instead of patterns Named-Entity Recognition Requirements:

1) Specify what is being extracted

2) Specify sources and avoid manual creation

Gazetteer Methods

Sematic WebSoftwareEnergySupermarket…

industry

The software giant SAP…Tesco a UK supermarket …

Siemens energy revenue…… wind energy company Vestas

3. Classification techniques Break down IE task in a set of binary

tasks

Classification Methods

possemTag

c1

c2

..cn

Classifier

features

Classical


Galway Germany DERI Siemens

GEIrelandMunich CITECmissclassification doesnot consider structure!(equal cost 1/6)

DERI

TescoCladdagh

DERI

CountryCity SW Energy

IndustryLocation

W1,6=3 Structure aware


Galway Germany Siemens

GEIrelandMunich CITECClassifier shouldconsider taxonomy structure!

TescoCladdagh

DERI

4. Partial parse trees TACITUS, SMES, LTAG

5. Analyze structured data Wikpedia Infoboxes

6. Web querying C-PANKOW

“Towards the self annotating web

Other methods

Technologies used in implementation

Shallow NLP (GATE, sProUT, StanfordNLP)

POS, sentence splitting, regular expression

Semantic lexicons (WordNet, GermaNet) synonym, meronym, hypernym

Semantic Annotation (OCAT, iDocument, PIMO)

Missing Terminological tools (UMLS,

bio terminologies) Thesauri, translation

memory

Data sets & evaluation

Data sets (corpora)1) Message Understanding Conference (MUC-

7)2) Automatic Content Extraction (ACE) => more on classical IR, IE, NLP tracks => no data set with given semantics

(ontology)

Evaluation Precision & recall

Only used for population task

Recent Open IE argument

Con: Weikum, From Information to Knowledge -Harvest Web Resources for IE

Disambiguation NL relations are not well defined (well

defined arguments)

Pro: Weld, Using Wiki to Bootrap Open IE

Relation targeted: learn extractor per relation -> lower recall

Structural targeted: general extraction engine -> lower precision

Conclusion and Outlook

No established/ agreed methods yet Is OBIE also ontology learning? Data sets Methods for best extractors

Semantic Web contribution? eg. Gazetteers from DBPedia

Cross-lingual OBIE -> CLOBIE

References

[1] Wimalasuriya, Dou, Ontology-based Information Extraction: An Introduction and Survey of current approaches, in Journal of Computer Science, June 2010

[2] Buitelaar et Al., Towards linguistically grounded ontologies., ESWC, Springer, 200

[3] Weikum et Al, From Information to Knowledge – Harvesting Entities and Relationships from Web Sources, Principle Database Systems, 2010

[4] Weld et al., Using Wikipedia to bootstrap open information extraction, Sigmod Record, 2008

http://ix.cs.uoregon.edu/~dou/research/papers/jis09.pdf

http://osm.cs.byu.edu/CS652s09/papers/BuitelaarCimiano...LinguisticallyGroundedOntologies.eswc09.pdf

http://www.mpi-sb.mpg.de/~weikum/pods2010-weikum&theobald.pdf

http://www.cs.washington.edu/homes/weld/papers/weld-sigmod-rec08.pdf

Ontology-based information extraction in the DERI Reading Group

Technology

ontology good

domain ontology text

ontology population

definition ontology

methods information

ontology c present output

data sets methods

instances county building