Top Banner
Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University
12

Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.

Data Extraction From HTML Tables

Cui Tao

Department of Computer Science

Brigham Young University

Page 2: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.

Information In Tables

Nowadays, significant portion of the information on the Wed is stored in tables.

Page 3: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.

The Ontology-Based Extraction

Page 4: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.

The Ontology-Based Extraction

Page 5: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.

Major Problems

In the tables, the values and their corresponding attributes are separately. But the ontology can only extract the data when they are together.

Sometimes the attributes in the table are the values in the database, the values in the table are only the identifier of the attributes.

Sometimes, the values in one cell of the table may informs several attribute values in the database.

Page 6: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.

Attribute-Value Pair

Attribute: (part of the) constant/key word rule

Page 7: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.

How To Solve This Problem?

Put the attribute-value pair together.Try both order.

Page 8: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.

More General…

Page 9: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.

The attributes in the table are actually values in the database…

Attribute Value

Page 10: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.

How To Solve This Problem?

Put attribute in the file depends on the Boolean value

Page 11: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.

Value Multiple Information

Page 12: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.

More Problems …