Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak Presented by Aaron Stewart BYU CS 652 Table Extraction Using Spatial Reasoning in the CSS2 Visual Box Model Database and Artificial Inteligence Group Vienna University of Technology, Austria Wolfgang Gatterbauer and Paul Bohunsky
26
Embed
Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Towards Domain-Independent Information Extraction from Web Tables
Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog,Bernhard Krupl, and Bernhard Pollak
Presented by Aaron StewartBYU CS 652
Table Extraction Using Spatial Reasoning in the CSS2 Visual Box Model
Database and Artificial Inteligence GroupVienna University of Technology, Austria
Wolfgang Gatterbauer and Paul Bohunsky
Contributions
1. Classify visually structured data2. Non-tree IE formalism3. Argue to defer semantic interpretation of
output4. Ground truthing method5. Web table test set6. Visual results
Introduction
Source: Gatterbauer et al. 2007
Visually Structured Data on the Web
• Tables• Lists• Aligned Graphs
Visually Structured Data on the Web
Source: Gatterbauer et al. 2007
Formal Setup
• DOM Tree Representation• Visual Box Representation– Visualized Element Nodes (VENs)• DOM nodes with bounding boxes
– Visualized Words• Text words with bounding boxes
Formal Setup
Source: Gatterbauer et al. 2007
Information Extraction
• Visualized Element Nodes Table extraction (VENTex)