The Web’s Many Models
Michael J. Cafarella University of Michigan
AKBCMay 19, 2010
?
2
Web Information Extraction Much recent research in information
extractors that operate over Web pages Snowball (Agichtein and Gravano, 2001) TextRunner (Banko et al, 2007) Yago (Suchanek et al, 2007) WebTables (Cafarella et al, 2008) DBPedia, ExDB, Freebase (make use of IE data)
Web crawl + domain-independent IE should allow comprehensive Web KBs with: Very high, “web-style” recall “More-expressive-than-search” query
processing But where is it?
3
Web Information Extraction Omnivore
“Extracting and Querying a Comprehensive Web Database.” Michael Cafarella. CIDR 2009. Asilomar, CA.
Suggested remedies for data ingestion, user interaction
This talk says why ideas in that paper might already be out of date, gives alternative ideas
If there are mistakes here, then you have a chance to save me years of work!
4
Outline Introduction Data Ingestion
Previously: Parallel Extraction Alternative: The Data-Centric Web
User Interaction Previously: Model Generation for
Output Alternative: Data Integration as UI
Conclusion
5
Parallel Extraction Previous hypothesis
Many data models for interesting data, e.g., relational tables and E/R graphs, etc.
Should build large integration infrastructure to consume many extraction streams
6
Database Construction (1)
Start with a single large Web crawl
7
Database Construction (2)
Each of k extractors emits output that: Has an extractor-dependent model Has an extractor-and-Web-page-
dependent schema
8
Database Construction (3)
For each extractor output, unfold into common entity-relation model
9
Database Construction (4)
Unify results
10
Database Construction (5)
Emit final database
11
Potential Problems Pressing problems:
Recall Simple intra-source reconciliation Time
Tables, entities probably OK for now Many data sources (DBPedia, Facebook,
IMDB) already match one of these two pretty well
One possible different direction: the Data-Centric Web Addresses recall only
12
The Data-Centric Web
13
The Data-Centric Web
14
The Data-Centric Web
15
The Data-Centric Web
16
The Data-Centric Web
17
The Data-Centric Web
18
The Data-Centric Web
19
The Data-Centric Web
20
The Data-Centric Web
21
The Data-Centric Web
22
The Data-Centric Web
23
The Data-Centric Web
24
Data-Centric Lists Lists of Data-Centric Entities give
hints: About what the target entity contains
That all members of set are DCEs, or not
That members of set belong to a class or type (e.g., program committee members)
25
Build the Data-Centric Web1. Download the Web2. Train classifiers to detect DCEs, DCLs3. Filter out all pages that fail both tests4. Use lists to fix up incorrect Data-Centric
Entity classifications5. Run attr/val extractors on DCEs
Yields E/R dataset, for insertion into DBPedia, YAGO, etc
In progress now… with student Ashwin Balakrishnan, entity detector >95% acc.
26
Research Question 1 How many useful entities…
Lack a page in the Data-Centric Web? (That means no homepage, no Amazon
page, no public Facebook page, etc.) AND are otherwise well-described
enough online that IE can recover an entity-centric view?
Put differently: Does every entity worth extracting
already have a homepage on the Web?
27
Research Question 2 Does a single real-world entity
have more than one “authoritative” URL? Note that Wikipedia provides pretty
minimal assistance in choosing the right entity, but does a good job
28
Outline Introduction Data Ingestion
Previously: Parallel Extraction Alternative: The Data-Centric Web
User Interaction Previously: Model Generation for
Output Alternative: Data Integration as UI
Conclusion
29
Model Generation for Output Previous hypothesis
Many different user applications built against single back-end database
Difficult task is translating from back-end data model to the application’s data model
30
Query Processing (1)
Query arrives at system
31
Query Processing (2)
Entity-relation database processor yields entity results
32
Query Processing (3)
Query Renderer chooses appropriate output schema
33
Query Processing (4)
User corrections are logged and fed into later iterations of db construction
34
Potential Problems Many plausible front-end applications,
none yet totally compelling and novel Ad- and search-driven ones not novel Freebase, Wolfram Alpha not compelling Raw input to learners: useful, not an
end-user application Need to explore possible applications
rather than build multi-app infrastructure
One possible different direction: data integration as user primitive
35
Data Integration as UI Can we combine tables to create
new data sources? Many existing “mashup” tools,
which ignore realities of Web data A lot of useful data is not in XML User cannot know all sources in
advance Transient integrations Dirty data
36
Interaction Challenge Try to create a database of all“VLDB program committee members”
37
Provides “workbench” of data integration operators to build target database Most operators are not correct/incorrect, but
high/low quality (like search) Also, prosaic traditional operators
Originally ran on WebTable data [VLDB 2009, Cafarella, Khoussainova,
Halevy]
Octopus
38
Walkthrough - Operator #1 SEARCH(“VLDB program committee members”)
serge abiteboul inria
anastassia ail… carnegie…
gustavo alonso etz zurich
… …
serge abiteboul inria
michael adiba …grenoble
antonio albano …pisa
… …
39
Walkthrough - Operator #2 Recover relevant data
serge abiteboul inria
michael adiba …grenoble
antonio albano …pisa
… …
serge abiteboul inria
anastassia ail… carnegie…
gustavo alonso etz zurich
… …
CONTEXT()
CONTEXT()
40
Walkthrough - Operator #2 Recover relevant data
serge abiteboul inria 1996
michael adiba …grenoble 1996
antonio albano …pisa 1996
… … …
serge abiteboul inria 2005
anastassia ail… carnegie… 2005
gustavo alonso etz zurich 2005
… … …
CONTEXT()
CONTEXT()
41
Walkthrough - Union Combine datasets
serge abiteboul inria 1996
michael adiba …grenoble 1996
antonio albano …pisa 1996
… … …
serge abiteboul inria 2005
anastassia ail… carnegie… 2005
gustavo alonso etz zurich 2005
… … …
Union()
serge abiteboul inria 1996
michael adiba …grenoble 1996
antonio albano …pisa 1996
serge abiteboul inria 2005
anastassia ail… carnegie… 2005
gustavo alonso etz zurich 2005
… … …
42
Walkthrough - Operator #3 Add column to data Similar to “join” but join target is a topic
EXTEND( “publications”, col=0)
serge abiteboul inria 1996
michael adiba …grenoble 1996
antonio albano …pisa 1996
serge abiteboul inria 2005
anastassia ail… carnegie… 2005
gustavo alonso etz zurich 2005
… … …
serge abiteboul inria 1996 “Large Scale P2P Dist…”
michael adiba …grenoble 1996 “Exploiting bitemporal…”
antonio albano …pisa 1996 “Another Example of a…”
serge abiteboul inria 2005 “Large Scale P2P Dist…”
anastassia ail… carnegie… 2005 “Efficient Use of the…”
gustavo alonso etz zurich 2005 “A Dynamic and Flexible…”
… … …
• User has integrated data sources with little effort• No wrappers; data was never intended for reuse
“publications”
43
CONTEXT Algorithms Input: table and source page Output: data values to add to table
SignificantTerms sorts terms in source page by “importance” (tf-idf)
44
Related View Partners Looks for different “views” of same
data
45
CONTEXT Experiments
46
Data Integration as UI Compelling for db researchers, but
will large numbers of people use it?
47
Conclusion Automatic Web KBs rapidly
progressing Recall still not good enough for many
tasks, but progress is rapid Not clear what those tasks should be, and
progress is much slower Difficult to predict what’s useful Sometimes difficult to write a “new app” paper
Omnivore’s approach not wrong, but did not directly address these problems