Slide 1 International Semantic Web Conference Riva del Garda, Italy, 22.10.2014 Semantic Web Challenge – Big Data Track Extending Tables with Data from over a Million Websites Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Kai Eckert, Heiko Paulheim, Christian Bizer
25
Embed
Extending Tables with Data from over a Million Websites
The slideset describes the Mannheim Search Join Engine and was used to present our submission to the Semantic Web Challenge 2014.
More information about the Semantic Web Challenge: http://challenge.semanticweb.org/
Paper about the Mannheim Search Join Engine: http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/Lehmberg-Ritze-Ristoski-Eckert-Paulheim-Bizer-TableExtension-SemanticWebChallenge-ISWC2014-Paper.pdf
Abstract:
This Big Data Track submission demonstrates how the BTC 2014 dataset, Microdata annotations from thousands of websites, as well as millions of HTML tables are used to extend local tables with additional columns. Table extension is a useful operation within a wide range of application scenarios: Image you are an analyst having a local table describing companies and you want to extend this table with the headquarter of each company. Or imagine you are a film lover and want to extend a table describing films with attributes like director, genre, and release date of each film. The Mannheim SearchJoin Engine automatically performs such table extension operations based on a large data corpus gathered from over a million websites that publish structured data in various formats. Given a local table, the SearchJoin Engine searches the corpus for additional data describing the entities of the input table. The discovered data are then joined with the local table and their content is consolidated using schema matching and data fusion methods. As result, the user is presented with an extended table and given the opportunity to examine the provenance of the added data. Our experiments show that the Mannheim SearchJoin Engine achieves a coverage close to 100% and a precision of around 90% within different application scenarios.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
International Semantic Web ConferenceRiva del Garda, Italy, 22.10.2014
Semantic Web Challenge – Big Data Track
Extending Tables with Data from over a Million Websites
Oliver Lehmberg, Dominique Ritze, Petar Ristoski,
Kai Eckert, Heiko Paulheim, Christian Bizer
Slide 2
Extend a local table with additional columns using different types of Web data.
Subject Column = Name of the entity HTML tables: Most unique string column, break ties by taking leftmost.
Table generation from Linked Data and Microdata generate one table per class and website
subject column: rdfs:label, foaf:name, x:name
we exploit common vocabularies
Rank Film Studio Director Length1. Star Wars –Episode 1 Lucasfilm George Lucas 121 min2. Alien Brandwine Ridley Scott 117 min3. Black Moon NEF Louis Malle 100 min
Slide 12
Indexed Tables
Selection Conditions: 1. Minimum size of 3 columns and 5 rows
2. Subject column detection successful
Total # of tables: 36.3 million
Total # of PLDs: ~ 1.5 million
Total # of triples: 3.0 billion
Slide 13
The Mannheim Search Joins Engine (MSJE)
Collection of tables Table NormalizationTable Storage Table Index
1. Table Indexing
Input query table Table Preprocessing
Search
2. Table Search
3. Data Consolidation
Data collection
User Preferences
ConsolidationMultiJoin Top k Candidates
Slide 14
The Search Operator
Table Ranking subject column value overlap
extended Jaccard Similarity (FastJoin)
Select TopK Tables 1000 tables in the single column experiments
The Search operator determines the set of relevant Web tables.
Relevant
Slide 15
Multi-Join Operator
The MultiJoin operator performs a series of left-outer joins between the query table and all tables in the input set.
No. Region
1 Alsace
2 Lorraine
3 Guadeloupe
4 Centre
Unemploy
11 %
12 %
28 %
10 %
Unemploy
NULL
NULL
NULL
9.4 %
GDP
45.914 €
51.233 €
NULL
NULL
GDP per C
45.000 €
NULL
19.000 €
59.500 €
Slide 16
Consolidation Operator
Column Matching Combination of label- and instance-based techniques
Conflict Resolution Strings: majority vote
Numeric values: average,median, clustering and vote
The consolidation operator merges corresponding columns and fuses values in order to return a concise result table.
No Region Unemploy GDP
1 Alsace 11 % 45.914 €
2 Lorraine 12 % 51.233 €
3 Guadeloupe
28 % 19.000 €
4 Centre 10 % 59.500 €
Slide 17
http://searchjoins.webdatacommons.org
Slide 18
Result: Extend with Single Column
Slide 19
Provenance Summary
Slide 20
Provenance Details
Slide 21
Evaluation Results
Author Head‐quarter Industry Area Capital Code Currency Popu‐
lationIngre‐dient Cast Director Genre Year Artist Team