Top Banner
Integration of Friendly Data Islands on the Web. Information Extraction.
56

Integration of Friendly Data Islands on the Web. Information Extraction.

Jan 03, 2016

Download

Documents

Integration of Friendly Data Islands on the Web. Information Extraction. Roadmap. Introduction What extraction rules are Generating extraction rules A couple of systems Conclusions. Roadmap. Introduction What extraction rules are Generating extraction rules A couple of systems - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Integration of Friendly Data Islands on the Web. Information Extraction.

Integration of Friendly Data Islands on the Web.

Information Extraction.

Page 2: Integration of Friendly Data Islands on the Web. Information Extraction.

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions

Page 3: Integration of Friendly Data Islands on the Web. Information Extraction.

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions

Page 4: Integration of Friendly Data Islands on the Web. Information Extraction.

The theory• A wrapper is a building

block that provides an ad-hoc, message-based API to an app

• They interface apps at one or more layers, but, more often than not, they must deal with the user interface or the data layer

User Interface

Controller

Business Logic

Data AccessLayer

Data Layer

Page 5: Integration of Friendly Data Islands on the Web. Information Extraction.

The problem

The Da Vinci Code

Buy

Dan BrownDoubleday, 200615.95 €

Robert Langdon is a Harvard Professor of Symbology…

Page 6: Integration of Friendly Data Islands on the Web. Information Extraction.

Features of current web documents

• Trillions of documents• Generated on demand by software

applications• Change continuously• Require navigation from search forms• Written in telegraphic language• Formatted according to HTML templates

Page 7: Integration of Friendly Data Islands on the Web. Information Extraction.

The solution

Page 8: Integration of Friendly Data Islands on the Web. Information Extraction.

Wrapping in a nutshell• Goals

– Endow data islands with APIs

– Ease implementing software applications

• Implications– Form filling– Navigation– Info extraction– “Ontologisation”

Page 9: Integration of Friendly Data Islands on the Web. Information Extraction.

Look out!

• Information extraction has driven most research efforts

• Few wrapping systems are complete• Wrapping is usually mistaken for information

extraction• This talk is about engineering information

extraction for enabling information integration

Page 10: Integration of Friendly Data Islands on the Web. Information Extraction.

How IE works

Information extractor

Document

Extraction rules

Attributes

The Da Vinci Code

Dan Brown

15.95 €

2006

Robert Langdon…

Doubleday

Templates

Message ID: MUC-0001Message Template: Court resolutionDate of Event: April, 30 2007Charge: Terrorist attackPerpetrator: Salahuddin AminPerpetrator: Anthony GarciaPerpetrator: Waheed MahmoodPerpetrator: Omar Khyam…

The Da Vinci Code

Dan Brown

15.95 €

2006

P1

Robert Langdon…

Doubleday

A1

B1

Ontology instances

Templating/ Ontologisation rules

Page 11: Integration of Friendly Data Islands on the Web. Information Extraction.

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Side by side comparison• Conclusions

Page 12: Integration of Friendly Data Islands on the Web. Information Extraction.

Running example

Page 13: Integration of Friendly Data Islands on the Web. Information Extraction.

Running example<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>

<!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html>

<!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html>

Page 14: Integration of Friendly Data Islands on the Web. Information Extraction.

Kinds of extraction rules

• Regular expressions • First-order logic rules • Pointers into DOM tree • Context-free grammars • Tag trees

Page 15: Integration of Friendly Data Islands on the Web. Information Extraction.

TSIMMISTSIMMIS

Regular expressions

[Root, get("page.html"), "#"]

[BookReview, Root, "<body>#</body>"]

[BookName, BookReview, "</b>#<br/>"]

[Tmp, Rook, "<ul>#</ul>"]

[Reviews, Tmp, "split(Tmp, '<li>')"]

[ReviewerNames, Reviews, "Reviewer:</b>#<br/>"]

[Ratings, Reviews, "Rating:</b>#<br/>"]

[Text, Reviews, "Text:</b>#<br/>"]

[Root, get("page.html"), "#"]

[BookReview, Root, "<body>#</body>"]

[BookName, BookReview, "</b>#<br/>"]

[Tmp, Rook, "<ul>#</ul>"]

[Reviews, Tmp, "split(Tmp, '<li>')"]

[ReviewerNames, Reviews, "Reviewer:</b>#<br/>"]

[Ratings, Reviews, "Rating:</b>#<br/>"]

[Text, Reviews, "Text:</b>#<br/>"]

RoadRunnerRoadRunner

$FileName<html><body> <b>Book name:</b> $BookTitle <br/> <b>Reviews:</b> <br/> <ul> (( <li> <b>Reviewer:</b> $ReviewerName <br/> <b>Rating:</b> $Rating <br/> <b>Text:</b> $Text </li> )+)? </ul></body></html>

$FileName<html><body> <b>Book name:</b> $BookTitle <br/> <b>Reviews:</b> <br/> <ul> (( <li> <b>Reviewer:</b> $ReviewerName <br/> <b>Rating:</b> $Rating <br/> <b>Text:</b> $Text </li> )+)? </ul></body></html>

Page 16: Integration of Friendly Data Islands on the Web. Information Extraction.

First-order logic rules

SRVSRV

bookTitle(X) :- prev(X, "Book name:</b>"), next(X, "<br/>").

reviewerName(X) :- prev(X, "name:</b>"),next(X, "<br/>"), !bookTitle(X).

rating(X) :- isNatural(X), length(X, 1), inList(X).

text(X) :- prev(X, "Text:</b>"),next(X, "</li>").

bookTitle(X) :- prev(X, "Book name:</b>"), next(X, "<br/>").

reviewerName(X) :- prev(X, "name:</b>"),next(X, "<br/>"), !bookTitle(X).

rating(X) :- isNatural(X), length(X, 1), inList(X).

text(X) :- prev(X, "Text:</b>"),next(X, "</li>").

Page 17: Integration of Friendly Data Islands on the Web. Information Extraction.

Pointer into the DOM tree

WebOQLWebOQL

select x’.Text, y’.Text, y’’’’.Text, y’’’’’’’.Textfrom x, y in browse("page.html")where x.Text = "Book name:" and y.Text = "Reviewer:"

select x’.Text, y’.Text, y’’’’.Text, y’’’’’’’.Textfrom x, y in browse("page.html")where x.Text = "Book name:" and y.Text = "Reviewer:"

Page 18: Integration of Friendly Data Islands on the Web. Information Extraction.

Context-free grammars

MinervaMinerva

Page ::= $FileName <html><body> Review </body></html>

Review ::= <b>Book name:</b> $BookName <br/> <b>Reviews:</b> <br/> <ul> (<li> Reviewer Rating Text <li>)* </ul>

Reviewer ::= <b>Reviewer:</b> $Reviewer <br/>

Rating ::= <b>Rating:</b> $Rating <br/>

Text ::= <b>Text:</b> $Text

Page ::= $FileName <html><body> Review </body></html>

Review ::= <b>Book name:</b> $BookName <br/> <b>Reviews:</b> <br/> <ul> (<li> Reviewer Rating Text <li>)* </ul>

Reviewer ::= <b>Reviewer:</b> $Reviewer <br/>

Rating ::= <b>Rating:</b> $Rating <br/>

Text ::= <b>Text:</b> $Text

Page 19: Integration of Friendly Data Islands on the Web. Information Extraction.

DEPTADEPTA

Tag trees

li

b b bbr br

Page 20: Integration of Friendly Data Islands on the Web. Information Extraction.

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions

Page 21: Integration of Friendly Data Islands on the Web. Information Extraction.

Classification

• Hand-crafted• Supervised induction• Little-supervised induction• Unsupervised induction

Page 22: Integration of Friendly Data Islands on the Web. Information Extraction.

Hand-crafted

The pattern to extract the title is

“…”

• Techniques– Natural intelligence

• Systems– TSIMMIS– Minerva– WebOQL– W4F– XWrap

Page 23: Integration of Friendly Data Islands on the Web. Information Extraction.

Supervised induction • Techniques

– Bottom-up ILP– Top-down ILP– Ad-hoc algorithms

• Systems– SRV– RAPIER– WIEN– WHISK– NoDoSE– SoftMealy– STALKER– DEByE

Raw documents

Labelled documents

Automated induction

Page 24: Integration of Friendly Data Islands on the Web. Information Extraction.

Little-supervised induction • Techniques

– String alignment– Tree alignment

• Systems– OLERA– Thresher

Raw document

Record and attribute labelling

Automated induction

Page 25: Integration of Friendly Data Islands on the Web. Information Extraction.

Unsupervised induction • Techniques

– String alignment– Tree alignment– Statistical roles

• Systems– DeLa– RoadRunner– EXALG– DEPTA– IEPAD

Raw documents

Automated induction

Pattern interpretation

Page 26: Integration of Friendly Data Islands on the Web. Information Extraction.

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions

Page 27: Integration of Friendly Data Islands on the Web. Information Extraction.

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems

– RoadRunner– SRV

• Conclusions

Page 28: Integration of Friendly Data Islands on the Web. Information Extraction.

Token matching<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

String mistmatch

$1$1

Page 29: Integration of Friendly Data Islands on the Web. Information Extraction.

...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

Tag match

$1<html>

$1<html>

Page 30: Integration of Friendly Data Islands on the Web. Information Extraction.

...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

Tag match

$1<html><body>

$1<html><body>

Page 31: Integration of Friendly Data Islands on the Web. Information Extraction.

...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

Tag match, string match, …

$1<html><body> <b>Book name:</b>

$1<html><body> <b>Book name:</b>

Page 32: Integration of Friendly Data Islands on the Web. Information Extraction.

...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

String mismatch, tag match

$1<html><body> <b>Book name:</b> $2 <br/>

$1<html><body> <b>Book name:</b> $2 <br/>

Page 33: Integration of Friendly Data Islands on the Web. Information Extraction.

...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>

Page 34: Integration of Friendly Data Islands on the Web. Information Extraction.

Stop: lists and optionals<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

Tag mismatch

$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>

Page 35: Integration of Friendly Data Islands on the Web. Information Extraction.

Stop: lists and optionals<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>

Page 36: Integration of Friendly Data Islands on the Web. Information Extraction.

Stop: lists and optionals<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+

$1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+

Page 37: Integration of Friendly Data Islands on the Web. Information Extraction.

…and matching finishes<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+ </ul></body></html>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+ </ul></body></html>

Page 38: Integration of Friendly Data Islands on the Web. Information Extraction.

Just union-free grammars!

Page 39: Integration of Friendly Data Islands on the Web. Information Extraction.

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems

– RoadRunner– SRV

• Conclusions

Page 40: Integration of Friendly Data Islands on the Web. Information Extraction.

Exercise

• Support predicates: next(x,y), previous(x,y)• Try to explain isCorD(X)

abcabdabbbcaabda

Page 41: Integration of Friendly Data Islands on the Web. Information Extraction.

Exercise

• Support Predicates: next(x,y), previous(x,y)• Now, try to Explain isCorDorE(X)

abcabdabeebbcaabdaee

Page 42: Integration of Friendly Data Islands on the Web. Information Extraction.

Target PredicatesTarget Predicates

Define target predicates

title: #PCDATA.

reviewer: #PCDATA.

rating: #PCDATA.

text: #PCDATA.

title: #PCDATA.

reviewer: #PCDATA.

rating: #PCDATA.

text: #PCDATA.

Page 43: Integration of Friendly Data Islands on the Web. Information Extraction.

Instantiate target predicates<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>

<!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html>

<!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html>

Page 44: Integration of Friendly Data Islands on the Web. Information Extraction.

Instantiate target predicatesPositive SamplesPositive Samples

title("Ontologies").

title("SPARQL in action").

title("W4F Explained").

reviewer("John Doe").

reviewer("Alan Wohl").

reviewer("Dan Smith").

rating("7").

rating("8").

rating("9").

text("blah, blah").

text("yeah, yeah").

text("cough, cough").

title("Ontologies").

title("SPARQL in action").

title("W4F Explained").

reviewer("John Doe").

reviewer("Alan Wohl").

reviewer("Dan Smith").

rating("7").

rating("8").

rating("9").

text("blah, blah").

text("yeah, yeah").

text("cough, cough").

Negative Samples Negative Samples

!title("Book name:").

!reviewer("Book name:").

!rating("Book name:").

!text("Book name:").

!title("Reviews:").

!reviewer("Reviews:").

!rating("Reviews:").

!text("Reviews:").

!title("Reviewer:").

!reviewer("Reviewer:").

!rating("Reviewer:").

!text("Reviewer:").

!title("Rating:").

!reviewer("Rating:").

!rating("Rating:").

!title("Book name:").

!reviewer("Book name:").

!rating("Book name:").

!text("Book name:").

!title("Reviews:").

!reviewer("Reviews:").

!rating("Reviews:").

!text("Reviews:").

!title("Reviewer:").

!reviewer("Reviewer:").

!rating("Reviewer:").

!text("Reviewer:").

!title("Rating:").

!reviewer("Rating:").

!rating("Rating:").

Page 45: Integration of Friendly Data Islands on the Web. Information Extraction.

Support PredicatesSupport Predicates

Define support predicates

prev: #PCDATA, #PCDATA.

next: #PCDATA, #PCDATA.

length: #PCDATA, #PCDATA.

isNatural: #PCDATA.

prev: #PCDATA, #PCDATA.

next: #PCDATA, #PCDATA.

length: #PCDATA, #PCDATA.

isNatural: #PCDATA.

Page 46: Integration of Friendly Data Islands on the Web. Information Extraction.

Instantiate support predicatesOn Positive SamplesOn Positive Samples

prev("Ontologies", "</b>").

next("Ontologies", "<br/>").

length("Ontologies", 10).

!isNatural("Ontologies").

prev("SPARQL in action", "</b>").

next("SPARQL in action", "<br/>").

length("SPARQL in action", 16).

!isNatural("SPARQL in action").

prev("W4F explained", "</b>").

next("W4F explained", "<br/>").

length("W4F explained", 16).

!isNatural("W4F explained").

prev("Ontologies", "</b>").

next("Ontologies", "<br/>").

length("Ontologies", 10).

!isNatural("Ontologies").

prev("SPARQL in action", "</b>").

next("SPARQL in action", "<br/>").

length("SPARQL in action", 16).

!isNatural("SPARQL in action").

prev("W4F explained", "</b>").

next("W4F explained", "<br/>").

length("W4F explained", 16).

!isNatural("W4F explained").

On Negative SamplesOn Negative Samples

prev("Book name:", "<b>").

next("Book name:", "</b>").

length("Book name:", 10).

!isNatural("Book name:").

prev("Reviews:", "<b>").

next("Reviews:", "</b>").

!isNatural("Reviews:").

prev("Reviewer:", "<b>").

next("Reviewer:", "</b>").

!isNatural("Reviewer:").

prev("Rating:", "<b>").

next("Rating:", "</b>").

!isNatural("Rating:").

prev("Book name:", "<b>").

next("Book name:", "</b>").

length("Book name:", 10).

!isNatural("Book name:").

prev("Reviews:", "<b>").

next("Reviews:", "</b>").

!isNatural("Reviews:").

prev("Reviewer:", "<b>").

next("Reviewer:", "</b>").

!isNatural("Reviewer:").

prev("Rating:", "<b>").

next("Rating:", "</b>").

!isNatural("Rating:").

Page 47: Integration of Friendly Data Islands on the Web. Information Extraction.

Top-down inductiontitle(X) :- . (3, 14)title(X) :- . (3, 14)

title(X) :- prev(X, X). (0, 0)title(X) :- prev(X, X). (0, 0)

title(X) :- !prev(X, X). (3, 14)title(X) :- !prev(X, X). (3, 14)

title(X) :- prev(X, Y). (3, 14)title(X) :- prev(X, Y). (3, 14)

title(X) :- !prev(X, Y). (?, ?)title(X) :- !prev(X, Y). (?, ?)

title(X) :- next(X, X). (0, 0)title(X) :- next(X, X). (0, 0)

title(X) :- !next(X, X). (3, 14)title(X) :- !next(X, X). (3, 14)

title(X) :- next(X, Y). (3, 14)title(X) :- next(X, Y). (3, 14)

title(X) :- !next(X, Y). (?, ?)title(X) :- !next(X, Y). (?, ?)

title(X) :- length(X, X). (0, 0)title(X) :- length(X, X). (0, 0)

title(X) :- prev(X, "<b>"). (0, 5)title(X) :- prev(X, "<b>"). (0, 5)

title(X) :- !prev(X, "<b>"). (3, 9)title(X) :- !prev(X, "<b>"). (3, 9)

title(X) :- prev(X, "</b>"). (3, 9)title(X) :- prev(X, "</b>"). (3, 9)

title(X) :- !prev(X, "</b>"). (0, 5)title(X) :- !prev(X, "</b>"). (0, 5)

Page 48: Integration of Friendly Data Islands on the Web. Information Extraction.

Rule selection

00

0

11

1 lnlnnp

p

np

ptGain

p0 = # positive bindings of R

n0 = # negative bindings of R

p1 = # positive bindings of R&A

n0 = # negative bindings of R&A

t = # positive bindings of both R and R&A

New covering Old coveringCombined covering

Page 49: Integration of Friendly Data Islands on the Web. Information Extraction.

Induction goes on…title(X) :- . (3, 14)title(X) :- . (3, 14)

title(X) :- prev(X, Y). (3, 14)title(X) :- prev(X, Y). (3, 14)

title(X) :- prev(X, Y), X = Y. (?, ?)title(X) :- prev(X, Y), X = Y. (?, ?)

title(X) :- prev(X, Y), X != Y. (?, ?)title(X) :- prev(X, Y), X != Y. (?, ?)

title(X) :- prev(X, Y), prev(X, X). (?, ?)title(X) :- prev(X, Y), prev(X, X). (?, ?)

title(X) :- prev(X, Y), !prev(X, X). (?, ?)title(X) :- prev(X, Y), !prev(X, X). (?, ?)

title(X) :- prev(X, Y), prev(X, Z). (?, ?)title(X) :- prev(X, Y), prev(X, Z). (?, ?)

title(X) :- prev(X, Y), !prev(X, Z). (?, ?)title(X) :- prev(X, Y), !prev(X, Z). (?, ?)

title(X) :- prev(X, Y), prev(Y, X). (?, ?)title(X) :- prev(X, Y), prev(Y, X). (?, ?)

Page 50: Integration of Friendly Data Islands on the Web. Information Extraction.

…and on…title(X) :- . (3, 14)title(X) :- . (3, 14)

title(X) :- prev(X, Y). (3, 14)title(X) :- prev(X, Y). (3, 14)

title(X) :- prev(X, Y), Y = "</b>". (?, ?)title(X) :- prev(X, Y), Y = "</b>". (?, ?)

title(X) :- prev(X, Y), Y = "</b>", prev(X, X). (?, ?)title(X) :- prev(X, Y), Y = "</b>", prev(X, X). (?, ?)

title(X) :- prev(X, Y), Y = "</b>", !prev(X, X). (?, ?)title(X) :- prev(X, Y), Y = "</b>", !prev(X, X). (?, ?)

title(X) :- prev(X, Y), Y = "</b>", prev(Y, Y). (?, ?)title(X) :- prev(X, Y), Y = "</b>", prev(Y, Y). (?, ?)

title(X) :- prev(X, Y), Y = "</b>", !prev(Y, Y). (?, ?)title(X) :- prev(X, Y), Y = "</b>", !prev(Y, Y). (?, ?)

title(X) :- prev(X, Y), Y = "</b>", prev(X, Z). (?, ?)title(X) :- prev(X, Y), Y = "</b>", prev(X, Z). (?, ?)

title(X) :- prev(X, Y), Y = "</b>", !prev(X, Z). (?, ?)title(X) :- prev(X, Y), Y = "</b>", !prev(X, Z). (?, ?)

Page 51: Integration of Friendly Data Islands on the Web. Information Extraction.

…and eventually finishestitle(X) :- . (3, 14)title(X) :- . (3, 14)

title(X) :- prev(X, Y). (3, 14)title(X) :- prev(X, Y). (3, 14)

title(X) :- prev(X, Y), Y = "</b>". (?, ?)title(X) :- prev(X, Y), Y = "</b>". (?, ?)

title(X) :- prev(X, Y), Y = "</b>", prev(Y, "Book name:"). (3, 0)title(X) :- prev(X, Y), Y = "</b>", prev(Y, "Book name:"). (3, 0)

Page 52: Integration of Friendly Data Islands on the Web. Information Extraction.

Optimisations

• Intelligent predicates– Non-sense atoms– Non-sense atom combinations– Non-bindable variables

• Instantiated target predicates• Statistical analysis of constants• Keep track of non-instantiable predicates

Page 53: Integration of Friendly Data Islands on the Web. Information Extraction.

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions

Page 54: Integration of Friendly Data Islands on the Web. Information Extraction.

That's quite clear!

• Information extraction enables information integration

Page 55: Integration of Friendly Data Islands on the Web. Information Extraction.

Research challenges

• Information extraction– Efficient rule generation– Maintaining rules automatically– Union non-free Grammars (unsupervised)

• Ontologisation rules– Everything is a challenge

Page 56: Integration of Friendly Data Islands on the Web. Information Extraction.

Thanks!

Drop by our web site at http://www.tdg-seville.info