The Web’s Many Models

The Web’s Many Models

Michael J. Cafarella University of Michigan

AKBCMay 19, 2010

?

2

Web Information Extraction Much recent research in information

extractors that operate over Web pages Snowball (Agichtein and Gravano, 2001) TextRunner (Banko et al, 2007) Yago (Suchanek et al, 2007) WebTables (Cafarella et al, 2008) DBPedia, ExDB, Freebase (make use of IE data)

Web crawl + domain-independent IE should allow comprehensive Web KBs with: Very high, “web-style” recall “More-expressive-than-search” query

processing But where is it?

3

Web Information Extraction Omnivore

“Extracting and Querying a Comprehensive Web Database.” Michael Cafarella. CIDR 2009. Asilomar, CA.

Suggested remedies for data ingestion, user interaction

This talk says why ideas in that paper might already be out of date, gives alternative ideas

If there are mistakes here, then you have a chance to save me years of work!

4

Outline Introduction Data Ingestion

Previously: Parallel Extraction Alternative: The Data-Centric Web

User Interaction Previously: Model Generation for

Output Alternative: Data Integration as UI

Conclusion

5

Parallel Extraction Previous hypothesis

Many data models for interesting data, e.g., relational tables and E/R graphs, etc.

Should build large integration infrastructure to consume many extraction streams

6

Database Construction (1)

Start with a single large Web crawl

7


Each of k extractors emits output that: Has an extractor-dependent model Has an extractor-and-Web-page-

dependent schema

8


For each extractor output, unfold into common entity-relation model

9


Unify results

10


Emit final database

11

Potential Problems Pressing problems:

Recall Simple intra-source reconciliation Time

Tables, entities probably OK for now Many data sources (DBPedia, Facebook,

IMDB) already match one of these two pretty well

One possible different direction: the Data-Centric Web Addresses recall only

12

The Data-Centric Web

13


14


15


16


17


18


19


20


21


22


23


24

Data-Centric Lists Lists of Data-Centric Entities give

hints: About what the target entity contains

That all members of set are DCEs, or not

That members of set belong to a class or type (e.g., program committee members)

25

Build the Data-Centric Web1. Download the Web2. Train classifiers to detect DCEs, DCLs3. Filter out all pages that fail both tests4. Use lists to fix up incorrect Data-Centric

Entity classifications5. Run attr/val extractors on DCEs

Yields E/R dataset, for insertion into DBPedia, YAGO, etc

In progress now… with student Ashwin Balakrishnan, entity detector >95% acc.

26

Research Question 1 How many useful entities…

Lack a page in the Data-Centric Web? (That means no homepage, no Amazon

page, no public Facebook page, etc.) AND are otherwise well-described

enough online that IE can recover an entity-centric view?

Put differently: Does every entity worth extracting

already have a homepage on the Web?

27

Research Question 2 Does a single real-world entity

have more than one “authoritative” URL? Note that Wikipedia provides pretty

minimal assistance in choosing the right entity, but does a good job

28

Outline Introduction Data Ingestion

Previously: Parallel Extraction Alternative: The Data-Centric Web

User Interaction Previously: Model Generation for

Output Alternative: Data Integration as UI

Conclusion

29

Model Generation for Output Previous hypothesis

Many different user applications built against single back-end database

Difficult task is translating from back-end data model to the application’s data model

30

Query Processing (1)

Query arrives at system

31


Entity-relation database processor yields entity results

32


Query Renderer chooses appropriate output schema

33


User corrections are logged and fed into later iterations of db construction

34

Potential Problems Many plausible front-end applications,

none yet totally compelling and novel Ad- and search-driven ones not novel Freebase, Wolfram Alpha not compelling Raw input to learners: useful, not an

end-user application Need to explore possible applications

rather than build multi-app infrastructure

One possible different direction: data integration as user primitive

35

Data Integration as UI Can we combine tables to create

new data sources? Many existing “mashup” tools,

which ignore realities of Web data A lot of useful data is not in XML User cannot know all sources in

advance Transient integrations Dirty data

36

Interaction Challenge Try to create a database of all“VLDB program committee members”

37

Provides “workbench” of data integration operators to build target database Most operators are not correct/incorrect, but

high/low quality (like search) Also, prosaic traditional operators

Originally ran on WebTable data [VLDB 2009, Cafarella, Khoussainova,

Halevy]

Octopus

38

Walkthrough - Operator #1 SEARCH(“VLDB program committee members”)

serge abiteboul inria

anastassia ail… carnegie…

gustavo alonso etz zurich

… …


michael adiba …grenoble

antonio albano …pisa

… …

39

Walkthrough - Operator #2 Recover relevant data


michael adiba …grenoble

antonio albano …pisa

… …


anastassia ail… carnegie…

gustavo alonso etz zurich

… …

CONTEXT()

CONTEXT()

40

Walkthrough - Operator #2 Recover relevant data

serge abiteboul inria 1996

michael adiba …grenoble 1996

antonio albano …pisa 1996

… … …


anastassia ail… carnegie… 2005

gustavo alonso etz zurich 2005

… … …

CONTEXT()

CONTEXT()

41

Walkthrough - Union Combine datasets




… … …




… … …

Union()







… … …

42

Walkthrough - Operator #3 Add column to data Similar to “join” but join target is a topic

EXTEND( “publications”, col=0)







… … …

serge abiteboul inria 1996 “Large Scale P2P Dist…”

michael adiba …grenoble 1996 “Exploiting bitemporal…”

antonio albano …pisa 1996 “Another Example of a…”

serge abiteboul inria 2005 “Large Scale P2P Dist…”

anastassia ail… carnegie… 2005 “Efficient Use of the…”

gustavo alonso etz zurich 2005 “A Dynamic and Flexible…”

… … …

• User has integrated data sources with little effort• No wrappers; data was never intended for reuse

“publications”

43

CONTEXT Algorithms Input: table and source page Output: data values to add to table

SignificantTerms sorts terms in source page by “importance” (tf-idf)

44

Related View Partners Looks for different “views” of same

data

45

CONTEXT Experiments

46

Data Integration as UI Compelling for db researchers, but

will large numbers of people use it?

47

Conclusion Automatic Web KBs rapidly

progressing Recall still not good enough for many

tasks, but progress is rapid Not clear what those tasks should be, and

progress is much slower Difficult to predict what’s useful Sometimes difficult to write a “new app” paper

Omnivore’s approach not wrong, but did not directly address these problems

The Web’s Many Models

Documents

The Web’s Many Models