Top Banner
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported by NSF
28

Schema Matching and Data Extraction over HTML Tables

Jan 18, 2016

Download

Documents

Ely

Schema Matching and Data Extraction over HTML Tables. Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University. supported by NSF. Introduction. Many tables on the Web How to integrate data stored in different tables? Detect the table of interest - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Schema Matching and Data Extraction over HTML Tables

Schema Matching and Data Extraction over HTML Tables

Cui Tao

Data Extraction Research GroupDepartment of Computer Science

Brigham Young University

supported by NSF

Page 2: Schema Matching and Data Extraction over HTML Tables

Introduction

Many tables on the Web How to integrate data stored in

different tables? Detect the table of interest Form attribute-value pairs (adjust if

necessary) Do extraction Infer mappings from extraction patterns

Page 3: Schema Matching and Data Extraction over HTML Tables

ProblemDetecting The Table of Interest

?

Page 4: Schema Matching and Data Extraction over HTML Tables

Problem

Different source table schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air

Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail,

Engine, Fuel Economy} Target database schema

{Car, Year, Make, Model, Mileage, Price, PhoneNr},

{Car, Feature}

Different schemas

Page 5: Schema Matching and Data Extraction over HTML Tables

ProblemAttribute is Value

Page 6: Schema Matching and Data Extraction over HTML Tables

Problem Attribute-Value is Value

? ?

Page 7: Schema Matching and Data Extraction over HTML Tables

ProblemValue is not Value

Page 8: Schema Matching and Data Extraction over HTML Tables

ProblemFactored Values

Page 9: Schema Matching and Data Extraction over HTML Tables

ProblemSplit Values

Page 10: Schema Matching and Data Extraction over HTML Tables

ProblemMerged Values

Page 11: Schema Matching and Data Extraction over HTML Tables

ProblemInformation Behind Links

Single-ColumnTable (formattedas list)

Tableextendingover severalpages

Page 12: Schema Matching and Data Extraction over HTML Tables

Solution Detect the table of interest Form attribute-value pairs (adjust

if necessary) Do extraction Infer mappings from extraction

patterns

Page 13: Schema Matching and Data Extraction over HTML Tables

SolutionDetect The Table of Interest

‘Real’ table test Same number of values Table size

Attribute test Density measure test

# of ontology extracted values total # of values in the table

Page 14: Schema Matching and Data Extraction over HTML Tables

Solution Remove Factoring

2001

2001

2001

2000

2000

2000

2000

2000

2000

1999

1999

Page 15: Schema Matching and Data Extraction over HTML Tables

SolutionReplace Boolean Values

Page 16: Schema Matching and Data Extraction over HTML Tables

SolutionForm Attribute-Value Pairs

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

Page 17: Schema Matching and Data Extraction over HTML Tables

SolutionAdjust Attribute-Value Pairs

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

Page 18: Schema Matching and Data Extraction over HTML Tables

SolutionAdd Information Hidden Behind Links

Unstructured and semi-structured:

concatenate

<Price, $7,988>, <Mileage, 63,168 miles>, <Body Type, Car>, <Body Style, 4 DR Sedan>, <Transmission, Automatic>, <Engine, 3.0 L V-6>, <Doors, 4>, <Fuel Type, Gas>, <Stock Number, 22764>, <VIN, 1FAFP52U2WA139879>

Single attribute value pairs:Pair them together

List:Mark the beginning

and the end

<

>

Page 19: Schema Matching and Data Extraction over HTML Tables

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Page 20: Schema Matching and Data Extraction over HTML Tables

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Each row is a car.

Page 21: Schema Matching and Data Extraction over HTML Tables

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Page 22: Schema Matching and Data Extraction over HTML Tables

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Page 23: Schema Matching and Data Extraction over HTML Tables

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Page 24: Schema Matching and Data Extraction over HTML Tables

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Page 25: Schema Matching and Data Extraction over HTML Tables

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Page 26: Schema Matching and Data Extraction over HTML Tables

Experimental ResultsCar Advertisement Application domain 10 “training” tables

100% of the 57 mappings (no false mappings) 94.6% precision of the values in linked pages

(5.4% false declarations) 50 test tables

94.7% of the 300 mappings (no false mappings) On the bases of sampling 3,000 values in linked

pages, we obtained 97% recall and 86% precision

Page 27: Schema Matching and Data Extraction over HTML Tables

Other Applications Cell Phone Plan Application domain Soccer Player Application domain

Page 28: Schema Matching and Data Extraction over HTML Tables

Contribution Provides an approach to extract

information automatically from HTML tables

Suggests a different way to solve the problem of schema matching