Top Banner
Ontology-Based Information Extraction and Structuring Stephen W. Liddle School of Accountancy and Information Systems Brigham Young University Douglas M. Campbell, David W. Embley, and Randy D. Smith Research funded in part by Faneuil Research and Novell, Inc. Copyright 1998
27

Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Ontology-Based InformationExtraction and Structuring

Stephen W. Liddle†

School of Accountancy and Information Systems

Brigham Young University

Douglas M. Campbell, David W. Embley,‡ and Randy D. SmithResearch funded in part by †Faneuil Research and ‡Novell, Inc.

Copyright 1998

Page 2: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Motivation

Database-style queries are effective– Find red cars, 1993 or newer, < $5,000

• Select * From Car Where Color=“red” And Year >= 1993 And Price < 5000

Web is not a database– Uses keyword search– Retrieves documents, not records– Assuming we have a range operator:

• “red” and (1993 to 1998) and (1 to 5000)

Page 3: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Solutions

Web query languages Wait for XML to emerge

– Interoperation/Standards?– XML query language?

Wrappers– Hand-written or semi-automatically

generated parsers– Specific to source site, subject to change

Page 4: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Our Approach

Automatic wrapper generation Based on application ontology

– Augmented conceptual model– Defines constants, keywords, their

relationships Best for:

– Narrow ontological breadth– Data-rich documents

Page 5: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Car-Ad Ontology Object-Relationship Model + Data Frames

Year Price

MakeMileage

Model

Feature

PhoneNr

Extension

Car

hashas

has

has

is for

has

has

has

1..*

0..1

1..*

1..* 1..*

1..*

1..*

1..*

0..1 0..10..1

0..1

0..1

0..1

0..*

1..*

Graphical

Car [0:1] has Year [1:*];Year {regexp[2]: “\d{2} : \b’\d{2}\b, … };Car [0:1] has Make [1:*];Make {regexp[10]: “\bchev\b”, “\bchevy\b”, … };Car [0:1] has Model [1:*];Model {…};Car [0:1] has Mileage [1:*];Mileage {regexp[8] “\b[1-9]\d{1,2}k”, “1-9]\d?,\d{3} : [^\$\d][1-9]\d?,\d{3}[^\d]” } {context: “\bmiles\b”, “\bmi\.”, “\bmi\b”};Car [0:*] has Feature [1:*];Feature {regexp[20]: -- Colors “\baqua\s+metallic\b”, “\bbeige\b”, … -- Transmission “(5|6)\s*spd\b”, “auto : \bauto(\.|,)”, -- Accessories “\broof\s+rack\b”, “\bspoiler\b”, …...

Textual

(See Figures 2 & 3 of Paper)

Page 6: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Fixed Processes ApplicationOntology

OntologyParser

Constant/KeywordRecognizer

Database-InstanceGenerator

UnstructuredDocument

Constant/KeywordMatching Rules

Data-Record Table

List of Objects, Relation-ships, and Constraints

DatabaseScheme

PopulatedDatabase

(See Figure 1 of Paper)

Page 7: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Constant/KeywordRecognizer

Database-InstanceGenerator

UnstructuredDocument

Data-Record Table

PopulatedDatabase

Make : \bchev\b…KEYWORD(Mileage) : \bmiles\bKEYWORD(Mileage) : \bmi\....

create table Car ( Car integer, Year varchar(2), … );create table CarFeature ( Car integer, Feature varchar(10)); ...

Object: Car;...Car: Year [0:1];Car: Make [0:1];…CarFeature: Car [0:*] has Feature [1:*];

Ontology Parser ApplicationOntology

OntologyParser

Constant/KeywordMatching Rules

List of Objects, Relation-ships, and Constraints

DatabaseScheme

Page 8: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

ApplicationOntology

OntologyParser

Database-InstanceGenerator

List of Objects, Relation-ships, and Constraints

DatabaseScheme

PopulatedDatabase

Constant/Keyword Recognizer

Descriptor/String/Position(start/end)Year|97|1|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|108|114PhoneNr|556-3800|146|153

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! Asking only $11,995. #1415.JERRY SEINER MIDVALE, 566-3800

Constant/KeywordRecognizer

UnstructuredDocument

Constant/KeywordMatching Rules

Data-Record Table

Page 9: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

ApplicationOntology

OntologyParser

Constant/KeywordRecognizer

UnstructuredDocument

Constant/KeywordMatching Rules

Database-Instance Generator

insert into Car values(1001, “97”, “CHEV”, “Cavalier”, “7,000”, “11,995”, “556-3800”)insert into CarFeature values(1001, “Red”)insert into CarFeature values(1001, “5 spd”)

Database-InstanceGenerator

Data-Record Table

List of Objects, Relation-ships, and Constraints

DatabaseScheme

PopulatedDatabase

Page 10: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Heuristics

Keyword proximity Subsumed and overlapping constants Functional relationships Nonfunctional relationships First occurrence without constraint

violation

Page 11: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Keyword Proximity

Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

Page 12: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Subsumed/Overlapping Constants

Make|CHEV|5|8Make|CHEVROLET|5|13Model|Cavalier|15|22Feature|Red|25|27Feature|5 spd|30|34Mileage|7,000|42|46KEYWORD(Mileage)|miles|48|52Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147

'97 CHEVROLET Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEVROLET Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

Page 13: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Functional Relationships

Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

Page 14: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Nonfunctional Relationships

Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

Page 15: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

First Occurrence without Constraint Violation

Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147PhoneNr|566-3802|149|156

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800, 566-3802

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800, 566-3802

Page 16: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

Recall & Precision

N

CRecall

IC

C

Precision

N = number of facts in sourceC = number of facts declared correctlyI = number of facts declared incorrectly

(of facts available, how many did we find?)

(of facts retrieved, how many were relevant?)

Page 17: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Experimental ResultsSalt Lake Tribune

Tuning set: 100Test set: 116

Recall % Precision %Year 100 100Make 97 100Model 82 100Mileage 90 100Price 100 100PhoneNr 94 100Extension 50 100Feature 91 99

(See Table 1 of Paper)

Page 18: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Trouble Spots Unbounded sets

– missed: MERC, Town Car, 98 Royale– could use lexicon of makes and models

Unspecified variation in lexical patterns– missed: 5 speed (instead of 5 spd), p.l (instead of p.l.)– could adjust lexical patterns

Misidentification of attributes– classified AUTO in AUTO SALES as automatic transmission– could adjust exceptions in lexical patterns

Typographical errors– “Chrystler”, “DODG ENeon”, “I-15566-2441”– could look for spelling variations and common typos

Page 19: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Contributions

Fully automatic technique for wrapper generation

Uses syntactic, not semantic constant-recognition techniques

Adapts readily to different unstructured document formats

Good precision & recall ratios Implemented (Perl, C++, Lex/Yacc, Java)

Page 20: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Limitations

Works best for data-rich documents, narrow ontological domains

Ontology creation is still manual– Domain expert– Trained in our conceptual model & tools

Page 21: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Future Work

Graphical ontology editor Improve automatic record-boundary

recognition– Make suitable for broader domains

(obituaries, university catalog, etc.) Improve heuristics

– Use a declarative language– Employ more of OSM’s rich constraints

Page 22: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Future Work (cont.)

Add operations to data frames– General constraints– Canonical representations– Inferred information

Develop ontology libraries Finish porting to 100% Java Incorporate learning/feedback Ontology-enabled agents

Page 23: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Our Web Site

I have a demo on my laptop Can download from our Web site BYU Data Extraction Group

http://osm7.cs.byu.edu/deg

(See Reference 13 of Paper)

Page 24: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Other Domains

Job Listings Obituaries University Course Catalogs

Page 25: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Job Listings ResultsLos Angeles Times

Tuning set: 50Test set: 50

Recall % Precision %Degree 100 100Skill 74 100Contact 100 100Email 91 83Fax 91 100Voice 79 92

(See Table 2 of Paper)

Page 26: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Obituaries ResultsSalt Lake Tribune

Tuning set: ~40Test set: 38

Recall % Precision %Deceased Name 100 100Age 91 95Birth Date 100 97Death Date 94 100Funeral Date 92 100Funeral Address 96 96Funeral Time 97 100Interment Address 100 100Viewing 93 96Viewing Date 70 100Viewing Address 76 100Beginning Time 88 100Ending Time 90 100Relationship 81 93Relative Name 88 71

(See our forthcoming ER’98 paper for details.)

Page 27: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Obituaries ResultsArizona Daily Star

Tuning set: ~40Test set: 90

Recall % Precision %Deceased Name 100 100Age 86 98Birth Date 96 96Death Date 84 99Funeral Date 96 93Funeral Address 82 82Funeral Time 92 87Interment Address 100 100Viewing 97 100Viewing Date 100 100Viewing Address 95 100Beginning Time 93 96Ending Time 95 100Relationship 92 97Relative Name 95 74

(See our forthcoming ER’98 paper for details.)