Top Banner
EXTRACTING DATA FROM THE WEB Georg Gottlob Oxford University .
82

EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

May 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

EXTRACTING DATA FROM THE WEB

Georg Gottlob

Oxford University

.

Page 2: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Talk Outline

•  Motivation: need of information extraction

•  Logical foundations of information extraction

•  The Lixto Visual Wrapper

•  The Diadem Project

Page 3: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

DECISION (e.g. pricing)

Data Warehouse

(entrepot de données) Enterprise

Data Analytics ETL

Traditional data-based decision making in enterprises.

Page 4: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Traditional data-based decision making in enterprises.

But often the most relevant data are outside the company, on the Web!

� Online data intelligence, online market intelligence, automatic web data extraction.

DECISION (e.g. pricing)

Enterprise Data

Analytics ETL Data Warehouse

(entrepot de données)

Page 5: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Online Market Intelligence (OMI) (surveillance du marché)

- Electronics Retailer (détaillant d�électronique - composants) : market overview, 20 competitors, 200,000 products/prices - Supermarket Chain: Price comparison; must quickly react to special offers (offres spéciales) , new products,…

- Internet Travel Agency: Gives best price guarantee, wants to detect �pricing attacks�,… - Road Construction Company: Find new public tenders (�appels d�offre�)

- Hedge Fund (�fonds de placement� ): Obtain recent house price changes from real-estate agent�s Web pages before the weekly index is published. Anticipating the Consumer price index (index des prix à la consommation).

- Governmental/Policy Making ….

Page 6: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

The Web corporate news pages

airline booking sites hotel reservation

real estate markets environmental data bookmakers

eBay jobs

retail prices tenders blogs

news

Quarterly reports in pdf

governmental info etc …

DECISION (e.g. pricing)

Enterprise Data

Analytics ETL Data Warehouse

(entrepot de données)

Page 7: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

The Web corporate news pages

airline booking sites hotel reservation

real estate markets environmental data bookmakers

eBay jobs

retail prices tenders blogs

news

Quarterly reports in pdf

governmental info etc …

Automatic web data extraction

Data aggregation & integration & cleaning

WEB ETL

DECISION (e.g. pricing)

Enterprise Data

Analytics ETL Data Warehouse

(entrepot de données)

Page 8: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Oracle 9

Marketing Department

BI Tool

Business Objects report

Marketing & Business Intelligence

entrepot de données

goulet d'étranglement

Page 9: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

The Wall Problem: Make web contents accessible to electronic data processing

WEB HTML pages

layout

Corporate edp apps

structured data, Databases,

XML

Page 10: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

WEB HTML pages

layout

Corporate edp apps

structured data, Databases,

XML

Travail aliénant

Page 11: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Web wrapping

Goal: Make web contents accessible to electronic data processing

WEB HTML pages

layout

Corporate edp apps

structured data, Databases,

XML

Page 12: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Web wrapping

WEB HTML pages

layout

Corporate edp apps

structured data, Databases,

XML

WRAPPER

Goal: Make web contents accessible to electronic data processing

Wrappers: HTML�select � extract � annotate �XML

(adapteur)

Page 13: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Enregistrement: hierarchie de données

Page 14: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Patterns:

Page 15: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

� Programming (Java, Perl, WebL, SQL+...) - very complicated & boring & expensive - testing very difficult � Simple Screen scrapers (“ gratte-écran“ ) - no complex data structures extracted � Wrapper induction (apprentissage d‘adapteurs) - requires larger amounts of sample data - precision often not satisfactory - current systems text-based (not tree-based)

Different approaches in the past

Page 16: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Different approaches in the past

� Programming (Java, Perl, WebL, SQL+...) - very complicated & boring & expensive - testing very difficult � Simple Screen scrapers - no complex data structures extracted � Wrapper induction - requires larger amounts of sample data - accuracy not satisfactory in all situations - current systems text-based (not tree-based)

Page 17: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

� Semi-automatic tool (outils)

- based on solid theory - modular knowledge representation - easy to use - commercial product since 2002

� Fully automated extraction - for specific application domains - extracts from 1000s of websites - current research

Modern Solutions

Page 18: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Talk Outline

•  Motivation: need of information extraction

•  Logical foundations of information extraction

•  The Lixto Visual Wrapper

•  The Diadem Project

Page 19: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Web documents are trees !

HTML: Hypertext Markup Language XML: Extensible Markup Language HTML, XML: Context free* languages. Represent a

document by its parse tree (arbre syntaxique).

Page 20: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

HTML Content Extractor

Function f: HTML Parse tree � Subtrees

Leaves of subtrees are among leaves of orig. tree

f

Page 21: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

The Essence of Web Wrapping ?

Functional view: Wrapper defines functions f f: Tree �� P (Tree) t � T ⊆ subtrees(t) Equivalent logical view: Wrapper defines monadic predicates P over the nodes (arbre dom) of each input document

Page 22: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

html

body

table

tr

td

tr

td td td td td

Christoph K

och

Georg G

ottlob

[email protected]

ien.ac.at

[email protected]

ien.ac.at

18449

18420

h1

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html> <body>

<h1>People @ DBAI</h1>

<table border="1" cellpadding="3" cellspacing="1">

<tr> <td>Georg Gottlob</td>

<td>[email protected]</td>

<td>18420</td>

</tr>

<tr> <td>Christoph Koch</td>

<td>[email protected]</td>

<td>18449</td>

</tr>

</table>

</body> </html>

A HTML page

Georg Gottlob gottlob@… 18420

Christoph Koch koch@… 18449

People @ DBAI

Page 23: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Predicate employeetable

Georg Gottlob gottlob@… 18420

Christoph Koch koch@… 18449

People @ DBAI

html

body

table

tr

td

tr

td td td td td

Christoph K

och

Georg G

ottlob

[email protected]

ien.ac.at

[email protected]

ien.ac.at

18449

18420

h1

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html> <body>

<h1>People @ DBAI</h1>

<table border="1" cellpadding="3" cellspacing="1">

<tr> <td>Georg Gottlob</td>

<td>[email protected]</td>

<td>18420</td>

</tr>

<tr> <td>Christoph Koch</td>

<td>[email protected]</td>

<td>18449</td>

</tr>

</table>

</body> </html>

Page 24: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Predicate employee

Georg Gottlob gottlob@… 18420

Christoph Koch koch@… 18449

People @ DBAI

html

body

table

tr

td

tr

td td td td td

Christoph K

och

Georg G

ottlob

[email protected]

ien.ac.at

[email protected]

ien.ac.at

18449

18420 h1

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html> <body>

<h1>People @ DBAI</h1>

<table border="1" cellpadding="3" cellspacing="1">

<tr> <td>Georg Gottlob</td>

<td>[email protected]</td>

<td>18420</td>

</tr>

<tr> <td>Christoph Koch</td>

<td>[email protected]</td>

<td>18449</td>

</tr>

</table>

</body> </html>

Page 25: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html> <body>

<h1>People @ DBAI</h1>

<table border="1" cellpadding="3" cellspacing="1">

<tr> <td>Georg Gottlob</td>

<td>[email protected]</td>

<td>18420</td>

</tr>

<tr> <td>Christoph Koch</td>

<td>[email protected]</td>

<td>18449</td>

</tr>

</table>

</body> </html>

Predicate phone

Georg Gottlob gottlob@… 18420

Christoph Koch koch@… 18449

People @ DBAI

html

body

table

tr

td

tr

td td td td td

Christoph K

och

Georg G

ottlob

[email protected]

ien.ac.at

[email protected]

ien.ac.at

18449

18420 h1

Page 26: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Expressiveness Yardstick: MSO •  MSO captures exactly the essence of data extraction: - Define sets of nodes of a document •  Expressiveness, complexity, semantics well

understood: � MSO over trees: perfect logical semantics � MSO over trees: high expressive power (tree automata) � MSO over trees: low data complexity

•  Drawbacks: - hard to use, no visual specification, - high query complexity (cpl. de requetes) (� bad scalability, mauvais passage à l�échelle).

Page 27: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

MSO on strings and trees

•  Büchi: MSO = REG over strings (chaînes de caractères)

•  Thatcher and Wright, Rabin:

MSO = REG over ranked trees (arbres bornés)

= tree automata

•  Brüggemann-Klein/Wood/Murata:

MSO = REG over unranked trees

•  Neven & Schwentick: Unranked Query Automata

•  Courcelle: MSO in LinTime on tree-like structures

(treewidth <= k, data complexity)

Rich theory:

Page 28: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

html

body

table

tr

td

tr

td td td td td

Christoph K

och

Georg G

ottlob

[email protected]

ien.ac.at

[email protected]

ien.ac.at

18449

18420

h1

Ordered Trees as finite structures

html

body

table

tr

td

tr

td td td td td

Christoph K

och

Georg G

ottlob

[email protected]

ien.ac.at

[email protected]

ien.ac.at

18449

18420 h1 firstchild

nextsibling

labelh1() labeltd()

root() leaf()

Page 29: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

MSO over Trees

Tree automaton:

auxiliary state

roots even subtree

roots odd subtree

Extract from a binary tree all roots of sub-trees with an odd number of leaves:

∃S ∀x [ S(u) & ( leaf(x)�S(x)) & ∀ x,y,z (((firstchild(x,y) & nextsibling(y,z))� (S(x) ↔ ¬(S(y) ↔ S(z))))]

Page 30: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

MSO over Trees Extract from a binary tree all roots of sub-trees with an odd number of leaves:

∃S ∀x [ S(u) & ( leaf(x)�S(x)) & ∀ x,y,z (((firstchild(x,y) & nextsibling(y,z))� (S(x) ↔ ¬(S(y) ↔ S(z))))]

Tree automaton:

Page 31: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

MSO over Trees

Tree automaton:

Extract from a binary tree all roots of sub-trees with an odd number of leaves:

∃S ∀x [ S(u) & ( leaf(x)�S(x)) & ∀ x,y,z (((firstchild(x,y) & nextsibling(y,z))� (S(x) ↔ ¬(S(y) ↔ S(z))))]

Page 32: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

MSO over Trees

Tree automaton:

Extract from a binary tree all roots of sub-trees with an odd number of leaves:

∃S ∀x [ S(u) & ( leaf(x)�S(x)) & ∀ x,y,z (((firstchild(x,y) & nextsibling(y,z))� (S(x) ↔ ¬(S(y) ↔ S(z))))]

Page 33: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

MSO over Trees

Tree automaton:

Extract from a binary tree all roots of sub-trees with an odd number of leaves:

∃S ∀x [ S(u) & ( leaf(x)�S(x)) & ∀ x,y,z (((firstchild(x,y) & nextsibling(y,z))� (S(x) ↔ ¬(S(y) ↔ S(z))))]

Page 34: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

MSO over Trees

Tree automaton:

Extract from a binary tree all roots of sub-trees with an odd number of leaves:

∃S ∀x [ S(u) & ( leaf(x)�S(x)) & ∀ x,y,z (((firstchild(x,y) & nextsibling(y,z))� (S(x) ↔ ¬(S(y) ↔ S(z))))]

Page 35: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

MSO over Trees

Tree automaton:

Extract from a binary tree all roots of sub-trees with an odd number of leaves:

∃S ∀x [ S(u) & ( leaf(x)�S(x)) & ∀ x,y,z (((firstchild(x,y) & nextsibling(y,z))� (S(x) ↔ ¬(S(y) ↔ S(z))))]

Page 36: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

MSO over Trees

Tree automaton:

Extract from a binary tree all roots of sub-trees with an odd number of leaves:

∃S ∀x [ S(u) & ( leaf(x)�S(x)) & ∀ x,y,z (((firstchild(x,y) & nextsibling(y,z))� (S(x) ↔ ¬(S(y) ↔ S(z))))]

Page 37: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

MSO over Trees

Tree automaton:

Extract from a binary tree all roots of sub-trees with an odd number of leaves:

∃S ∀x [ S(u) & ( leaf(x)�S(x)) & ∀ x,y,z (((firstchild(x,y) & nextsibling(y,z))� (S(x) ↔ ¬(S(y) ↔ S(z))))]

Page 38: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

MSO over Trees

Tree automaton:

Extract from a binary tree all roots of sub-trees with an odd number of leaves:

∃S ∀x [ S(u) & ( leaf(x)�S(x)) & ∀ x,y,z (((firstchild(x,y) & nextsibling(y,z))� (S(x) ↔ ¬(S(y) ↔ S(z))))]

Page 39: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

MSO over Trees

Tree automaton:

Extract from a binary tree all roots of sub-trees with an odd number of leaves:

∃S ∀x [ S(u) & ( leaf(x)�S(x)) & ∀ x,y,z (((firstchild(x,y) & nextsibling(y,z))� (S(x) ↔ ¬(S(y) ↔ S(z))))]

Page 40: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

MSO over Trees

Tree automaton:

Extract from a binary tree all roots of sub-trees with an odd number of leaves:

∃S ∀x [ S(u) & ( leaf(x)�S(x)) & ∀ x,y,z (((firstchild(x,y) & nextsibling(y,z))� (S(x) ↔ ¬(S(y) ↔ S(z))))]

Page 41: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Logic heaven

DB theory heaven

DB programming heaven

Application design heaven

Page 42: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Logic heaven

DB theory heaven

DB programming heaven

Application design heaven

MSO

Page 43: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Logic heaven

DB theory heaven

DB programming heaven

Application design heaven

MSO

Monadic Datalog =

Page 44: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Logic heaven

DB theory heaven

DB programming heaven

Application design heaven

MSO

Monadic Datalog

Elog

⊆⊆=

Page 45: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Logic heaven

DB theory heaven

DB programming heaven

Application design heaven

MSO

Monadic Datalog

Elog

Lixto Visual Wrapper

⊆⊆⊆

=

Page 46: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Logic heaven

DB theory heaven

DB programming heaven

Application design heaven

MSO

Monadic Datalog

Elog

Lixto Visual Wrapper

⊆⊆⊆

=⊆

Suite

Page 47: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Monadic Datalog as a Wrapping Language

html

body

table

tr

td

tr

td td td td td

root

entry(X) :- root(R), firstchild(R,U), label[html](U), firstchild(U,V), label[body](V), firstchild(V,W),label[table](W), firstchild(W,X), label[tr](X). entry(X):- entry(Y), nextsibling(Y,X).

name(X) :- entry(E), firstchild(E, X), label[td](X).

email(X) :- name(N), nextsibling(N, X), label[td](X).

phone(X) :- email(M), nextsibling(M, X), label[td](X).

Page 48: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Monadic Datalog as a Wrapping Language

html

body

table

tr

td

tr

td td td td td

root

entry(X) :- root(R), firstchild(R,U), label[html](U), firstchild(U,V), label[body](V), firstchild(V,W),label[table](W), firstchild(W,X), label[tr](X). entry(X):- entry(Y), nextsibling(Y,X).

name(X) :- entry(E), firstchild(E, X), label[td](X).

email(X) :- name(N), nextsibling(N, X), label[td](X).

phone(X) :- email(M), nextsibling(M, X), label[td](X).

Page 49: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

entry(X) :- root(R), firstchild(R,U), label[html](U), firstchild(U,V), label[body](V), firstchild(V,W),label[table](W), firstchild(W,X), label[tr](X). entry(X):- entry(Y), nextsibling(Y,X).

name(X) :- entry(E), firstchild(E, X), label[td](X).

email(X) :- name(N), nextsibling(N, X), label[td](X).

phone(X) :- email(M), nextsibling(M, X), label[td](X).

Monadic Datalog as a Wrapping Language

html

body

table

tr

td

tr

td td td td td

root

Page 50: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Monadic Datalog as a Wrapping Language

html

body

table

tr

td

tr

td td td td td

root

entry(X) :- root(R), firstchild(R,U), label[html](U), firstchild(U,V), label[body](V), firstchild(V,W),label[table](W), firstchild(W,X), label[tr](X). entry(X):- entry(Y), nextsibling(Y,X).

name(X) :- entry(E), firstchild(E, X), label[td](X).

email(X) :- name(N), nextsibling(N, X), label[td](X).

phone(X) :- email(M), nextsibling(M, X), label[td](X).

Page 51: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Monadic Datalog as a Wrapping Language

html

body

table

tr

td

tr

td td td td td

root

entry(X) :- root(R), firstchild(R,U), label[html](U), firstchild(U,V), label[body](V), firstchild(V,W),label[table](W), firstchild(W,X), label[tr](X). entry(X):- entry(Y), nextsibling(Y,X).

name(X) :- entry(E), firstchild(E, X), label[td](X).

email(X) :- name(N), nextsibling(N, X), label[td](X).

phone(X) :- email(M), nextsibling(M, X), label[td](X).

Page 52: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

entry(X) :- root(R), firstchild(R,U), label[html](U), firstchild(U,V), label[body](V), firstchild(V,W),label[table](W), firstchild(W,X), label[tr](X). entry(X):- entry(Y), nextsibling(Y,X).

name(X) :- entry(E), firstchild(E, X), label[td](X).

email(X) :- name(N), nextsibling(N, X), label[td](X).

phone(X) :- email(M), nextsibling(M, X), label[td](X).

Monadic Datalog as a Wrapping Language

html

body

table

tr

td

tr

td td td td td

root

Page 53: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

entry(X) :- root(R), firstchild(R,U), label[html](U), firstchild(U,V), label[body](V), firstchild(V,W),label[table](W), firstchild(W,X), label[tr](X). entry(X):- entry(Y), nextsibling(Y,X).

name(X) :- entry(E), firstchild(E, X), label[td](X).

email(X) :- name(N), nextsibling(N, X), label[td](X).

phone(X) :- email(M), nextsibling(M, X), label[td](X). html

body

table

tr

td

tr

td td td td td

root

<?xml version="1.0"?>

<peopledb>

<entry> <name>Georg Gottlob</name>

<email>[email protected]</email>

<phone>18420</phone>

</entry>

<entry> <name>Christoph Koch</name>

<email>[email protected]</email>

<phone>18449</phone>

</entry>

</peopledb>

Page 54: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Monadic Datalog over XML

paper

author title

“Conj. Queries” chandra merlin

fc

ns

fc

fc

ns

paperDB fc

paper ns

paper(X) � root(R) & firstchild(R,X). paper(X) � paper(Y) & nextsibling(Y,X). output(X)� paper(P) & firstchild(P,A) & firstchild(A,Z) & label[Chandra](Z) & nextsibling(Z,V) & label[Merlin](V) & nextsibling(A,T) & firstchild(T,X).

ns

Select titles of articles authored by Chandra and Merlin

Page 55: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

How expressive is monadic Datalog?

Over trees, monadic Datalog = MSO

It was known that over arbitrary structures: � Monadic Datalog ⊆ Π1-MSO

� Full Datalog = P (in presence of order)

Theorem [G. & Koch 2002]:

A unary query is definable in MSO iff it is definable via a monadic datalog program.

Page 56: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

How complex is Monadic Datalog?

Monadic Datalog over trees has combined complexity: O(|data|*|query|)

Query Complexity: P-complete and linear-time.

Theorem [G. & Koch 2002]:

Page 57: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Proof idea: 1.) Transform datalog program + input tree in linear time into a “ground” propositional logic program (programme Datalog instancié)

•  Exploit functional dependencies: nextsibling(X,Y) has only a linear number of ground instances: nextsibling(ni,nj), etc. •  Decouple independent atoms of rule bodies

p(X) �q(X) & r(Y) & nextsibling(X,Z) & s(Z).

p(X) �q(X) & r & nextsibling(X,Z) & s(Z). r � r(Y).

2.) Execute ground program in linear time by using well-known algorithms: [Beeri&Bernstein][Dowling&Gallier] [Minoux]

Page 58: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Logic heaven

DB theory heaven

DB programming heaven

Application design heaven

MSO

Monadic Datalog

Elog

Lixto Visual Wrapper

⊆⊆⊆

=

Page 59: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

one record

next page link

item description and link to detailpage

price info

date

# of bids

Page 60: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

ELOG [Baumgartner, Flesca, G. VLDB�01]

Examples of Special predicates:

subelem(S,X,Path,…) before(X,Y,…..) after(X,Y,…) property(X,Attribute, Op,Value…..)

Additional features: Stratified negation, string processing ontological concepts “phonenumber(X)” ranges: H(S,X) :- body(……..)[1,5] object hierarchies

distance tolerance,etc.

Xpath-like expression

document(URL,D) getdocumentFromHref(X,D), etc.

Page 61: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

<?xml version="1.0" encoding="UTF-8"?>

<document>

<record>

<number>409449118</number>

<item>98 Degrees - Notebook - New</item>

<picture/>

<price>2.99</price>

<currency>$</currency>

<bids>-</bids>

</record>

<record>

<number>413171469</number>

<item>Notebook - Compaq Presario 1207</item>

<price>730.00</price>

<currency>AU $</currency>

[...]

Page 62: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

ELOG Program for eBay pages

Page 63: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Logic heaven

DB theory heaven

DB programming heaven

Application design heaven

MSO

Monadic Datalog

Elog

Lixto Visual Wrapper

⊆⊆⊆

=

(outil: suite logicielle)

Page 64: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Web

Extraction- program

Extraction Module

XML

Further processing: tracking changes, delivering (email,sms) ... (� transformatio server)

similarly structured pages

Lixto Visual Wrapper Architecture

Visual Wrapper

Generator

Example page(s)

Page 65: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Product Architecture

LiXto Extraction Engine

Transformation Server

Page 66: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

SHORT DEMO

Page 67: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Talk Outline

•  Motivation: need of information extraction

•  Logical foundations of information extraction

•  The Lixto Visual Wrapper

•  The Diadem Project: Fully automatic data extraction

Page 68: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Need for Automatic Extraction Technology (2)

All search engine providers need it! Many work on it. Keywords: � Vertical search, � object search, � semantic search. Raghu Ramakrishnan, Yahoo!, March 2009: “no one really has done this successfully at scale yet” Alon Halevy, Google, Feb. 2009: “Current technologies are not good enough yet to provide what search engines really need. […] any successful approach would probably need a combination of knowledge and learning.”

Page 69: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news
Page 70: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news
Page 71: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

The Blackbox we are constructing

BLACKBOX

Application domain with thousands of websites

URL

Application relevant Structured data (XML or RDF)

To achieve this, we combine a host of annotators with a new knowledge-based approach.

Page 72: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

How to achieve it?

Combine existing and new �low level� annotators with �high level� AI and reasoning.

Page 73: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

<table>113

<tr> 134<tr>115

“I’m interested in”

<td>119

<table>124

radiobuttons

<tr>125 <tr>126

<td>129 <td>130

“Buying” “Renting”

<td>135

“Maximum price”

<select>136

<option>137<option>138

<td>139 <td>140

“GBP” “EUR”

Page 74: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Bottom-up (low-level) annotation

Monochromatic Rectangle

Georaphic query form

(formulaire de requete géo.)

Postcode input field

Active map (carte active)

….

ISA ISA

Occurs in

Price search facility

….

….

Occurs in

….

105

105 127

[(02873,227) (03900,417)]

Geo-Price-Searchbox

ISA

[(02873,227) (03900,417)]

Page 75: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Top-down reasoning

Property Search Facility

Property List

Single Property Description

Specially highlighted property

part-of m 1

Page 76: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Bottom-up processing Top-down reasoning

Monochromatic Rectangle

Georaphic search facility

Postcode input field

Active map

….

ISA ISA

Occurs in

Price search facility …

.

….

Occurs in

….

105

105 127

[(02873,227) (03900,417)]

Property Search Facility

Property List

Single Property Description

Geo-Price-Searchbox

ISA

[(02873,227) (03900,417)]

Specially highlighted property

Phenomenology

part-of m 1

Page 77: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

77

Phenomenological Record Segmentation

7

�   set of uniform, non-overlapping records

� maximise sequence of evenly segmented (same distance pivot)

� minimise irregularity of records

imga img a img img a img img

£860

div

£900 £500

div

data area

div

£900

p

£900

p

Page 78: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

78 7

98

98.5

99

99.5

100

data areas records attributes

precision recall

98

98.5

99

99.5

100

data areas records attributes

precision recall

Used Car(100 pages)

Real Estate(100 pages)

90

92.5

95

97.5

100

price postcode location bathroom bedroom reception legal type

precision recall(voitures d’occasion)

Page 79: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Form Patterns Example

�   Small set of ubiquitous patterns �   ranges, dates, options, etc.

�   Ontology by instantiation

79

OPAL � Form Interpretation O

77777777777777799999999999

Page 80: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

OPAL-TL Example

�   Price range �   two successive fields in the same group

�   at least one “price” type

�   range connector in between

80

TEMPLATE concept_minmax<C,CM,A> {concept<CM>(N1)⇐child(N1,G),child(N2,G),adjacent(N1,N2),N1@A{e,d},(concept<C>(N2) ∨ N2@A{e,d})

concept<CM>(N2)⇐child(N1,G),child(N2,G),follows(N2,N1),concept<C>(N1),N2@range_connector{e,d},¬(A1 ≺ A, N2@A1{d})

concept<CM>(N1)⇐child(N1,G),child(N2,G),adjacent(N1,N2),

N1@A{e,p},N2@A{e,p},((N1@min{e,p},N2@max{e,p})

∨ (N1@max{e,p},N2@min{e,p}))

Page 81: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Precision Recall F-score

0.94

0.955

0.97

0.985

1

UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436)U

0.9

0.92

0.94

0.96

0.98

1

Airfare Auto Book Job US R.E.

Dragut et al., VLDB, 2009

Page 82: EXTRACTING DATA FROM THE WEB - Collège de …...airline booking sites hotel reservation real estate markets environmental data bookmakers jobs eBay retail prices tenders blogs news

Short Demo diadem-3min43.m4v

82