Top Banner
The DERI Reading Group Ontology-based information extraction: An Overview & Survey (2010, Wimalasuriya and Dou) Tobias Wunner, UNLP Group Copyright 2010 Digital Enterprise Research Institute. All rights
23

Ontology-based information extraction in the DERI Reading Group

May 10, 2015

Download

Technology

Tobias Wunner

The DERI Reading Group (10.11.2010)

http://www.deri.ie/teaching/reading-groups/archive/
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ontology-based information extraction in the DERI Reading Group

The DERI Reading Group

Ontology-based information extraction: An Overview & Survey

(2010, Wimalasuriya and Dou)

Tobias Wunner, UNLP Group

Copyright 2010 Digital Enterprise Research Institute. All rights reserved, Paul Buitelaar

Page 2: Ontology-based information extraction in the DERI Reading Group

Definition - Motivation

a) Create content for the Semantic Web

convert existing websites into ontologies

b) Improve quality of existing ontologies

Test criterion: OBIE task

OBIE good => ontology good

Page 3: Ontology-based information extraction in the DERI Reading Group

Overview

Access to information…

Page 4: Ontology-based information extraction in the DERI Reading Group

Overview

Access to information…

Ontologie-based Information Extraction (OBIE):

“A system that processes unstructured or semi- structured natural language text guided by an ontology and presents the output in an ontology.

Page 5: Ontology-based information extraction in the DERI Reading Group

Overview

ESWC dogfood OBIE-related topics

New!

Page 6: Ontology-based information extraction in the DERI Reading Group

T

1. Text only: Extract conceptualization and instances

Countybuilding with café and football table

Building

is-a

1. conceptualization

2. instances

Galway DERI building

Problem – two scenarios

Page 7: Ontology-based information extraction in the DERI Reading Group

TCounty

building with café and football table

Building

is-a

1. conceptualization

2. instances

Galway DERI building

Problem – two scenarios

conceptualization can be too specific / generic

wrong conceptualization

1. Text only: Extract conceptualization and instances

Page 8: Ontology-based information extraction in the DERI Reading Group

TCity Building located

in

Conceptualization by domain ontology

2. instances

Galway DERI building

Problem – two scenarios

2. Domain ontology & text: extract instances only

Page 9: Ontology-based information extraction in the DERI Reading Group

TCity Building located

in

Conceptualization by domain ontology

2. instances

Galway DERI building

Problem – two scenarios

2. Domain ontology & text: extract instances only

less generic but more semantic stable

Page 10: Ontology-based information extraction in the DERI Reading Group

Definition – key characteristics

a) Process structured / unstructured text

b) “guided” by an ontology

c) Present output in ontology

TextSource

InformationExtractor

Ontology

guided by

Page 11: Ontology-based information extraction in the DERI Reading Group

Definition – ontology learning or population?

Ontology population ⊂ OBIE

“OBIE is Open information extraction”

(Etzioni)

alternative: semantics given by ontology!

extractors can be inside / outside ontology

TextSource

InformationExtractor

Ontology

guided by

Page 12: Ontology-based information extraction in the DERI Reading Group

Methods

Information extractors1. Linguistic rules

2. Gazetteer lists

3. Classification (classical / structure-aware)

4. Partial parse trees

5. Structured data analyzers

6. Web querying

Page 13: Ontology-based information extraction in the DERI Reading Group

Linguistic Rules - Methods

Regular expressions <COMPANY> .* revenue <Number>

<currency>

“Tesco’s revenue in 2009 was 3.4 billion GBP.”

Extraction ontologies combination of ontology and lexicon

(Mädche, Embley, Buitelaar) manual construction High precision

Page 14: Ontology-based information extraction in the DERI Reading Group

2. Gazetteer lists Phrases / words instead of patterns Named-Entity Recognition Requirements:

1) Specify what is being extracted

2) Specify sources and avoid manual creation

Gazetteer Methods

Sematic WebSoftwareEnergySupermarket…

industry

The software giant SAP…Tesco a UK supermarket …

Siemens energy revenue…… wind energy company Vestas

Page 15: Ontology-based information extraction in the DERI Reading Group

3. Classification techniques Break down IE task in a set of binary

tasks

Classification Methods

possemTag

c1

c2

..cn

Classifier

features

Page 16: Ontology-based information extraction in the DERI Reading Group

Classical

Classification Methods

Galway Germany DERI Siemens

GEIrelandMunich CITECmissclassification doesnot consider structure!(equal cost 1/6)

DERI

TescoCladdagh

DERI

CountryCity SW Energy

IndustryLocation

Page 17: Ontology-based information extraction in the DERI Reading Group

W1,6=3 Structure aware

Classification Methods

Galway Germany Siemens

GEIrelandMunich CITECClassifier shouldconsider taxonomy structure!

TescoCladdagh

DERI

Page 18: Ontology-based information extraction in the DERI Reading Group

4. Partial parse trees TACITUS, SMES, LTAG

5. Analyze structured data Wikpedia Infoboxes

6. Web querying C-PANKOW

“Towards the self annotating web

Other methods

Page 19: Ontology-based information extraction in the DERI Reading Group

Technologies used in implementation

Shallow NLP (GATE, sProUT, StanfordNLP)

POS, sentence splitting, regular expression

Semantic lexicons (WordNet, GermaNet) synonym, meronym, hypernym

Semantic Annotation (OCAT, iDocument, PIMO)

Missing Terminological tools (UMLS,

bio terminologies) Thesauri, translation

memory

Page 20: Ontology-based information extraction in the DERI Reading Group

Data sets & evaluation

Data sets (corpora)1) Message Understanding Conference (MUC-

7)2) Automatic Content Extraction (ACE) => more on classical IR, IE, NLP tracks => no data set with given semantics

(ontology)

Evaluation Precision & recall

Only used for population task

Page 21: Ontology-based information extraction in the DERI Reading Group

Recent Open IE argument

Con: Weikum, From Information to Knowledge -Harvest Web Resources for IE

Disambiguation NL relations are not well defined (well

defined arguments)

Pro: Weld, Using Wiki to Bootrap Open IE

Relation targeted: learn extractor per relation -> lower recall

Structural targeted: general extraction engine -> lower precision

Page 22: Ontology-based information extraction in the DERI Reading Group

Conclusion and Outlook

No established/ agreed methods yet Is OBIE also ontology learning? Data sets Methods for best extractors

Semantic Web contribution? eg. Gazetteers from DBPedia

Cross-lingual OBIE -> CLOBIE

Page 23: Ontology-based information extraction in the DERI Reading Group

References

[1] Wimalasuriya, Dou, Ontology-based Information Extraction: An Introduction and Survey of current approaches, in Journal of Computer Science, June 2010

[2] Buitelaar et Al., Towards linguistically grounded ontologies., ESWC, Springer, 200

[3] Weikum et Al, From Information to Knowledge – Harvesting Entities and Relationships from Web Sources, Principle Database Systems, 2010

[4] Weld et al., Using Wikipedia to bootstrap open information extraction, Sigmod Record, 2008