Top Banner
Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys
17

Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Dec 25, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Towards Automatic Structured Web Data Extraction System

Tomas Grigalis, 2nd year PhD student

Scientific supervisor: prof. habil. dr. Antanas Čenys

Page 2: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Outline

• Introduction

• The ClustVX approach

• Experiments

• Conclusions

Page 3: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Stuctured Web Data

Page 4: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Stuctured Web Data

Title Model Price <...>

Fuji FinePix Z110EXR 14MP 562/6283 £119.99Fujifilm XP30 14MP Waterproof 559/5101 £129.99Samsung ST200F Smart 559/7635 £111.99

<...>

Database

Table with stuctured data

Data Record

Browser

Rendered view in a web browser

Web server

Page 5: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

The GOAL

Stuctured data

Unsupervised and domain independent

stuctured web data extraction system

Web pages with structured data

Page 6: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Key Problems

• Web pages with visually similar appearance usually have totally different underlying HTML source code

• There are millions of web pages with different design and HTML source code

• WEB 2.0 introduced asynchronous JavaScript HTTP requests (AJAX), that modifies HTML source code on-the-fly

Page 7: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

The ClustVX approach

ClustVX is based on two fundamental observations:

1) Vast amount of information on the Web is presented using fixed templates and filled with data from underlying databases.

2) Although the templates and underlying data differ from site to site, humans understand it easily by analyzing repeating visual patterns on a given Web page

Page 8: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

HTML TREE

Page 9: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Repeating patterns in HTML TREE (1st observation)

Page 10: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Data which has the same semantic meaning is visualized using the same style (2nd observation)

PRICE

Page 11: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

ClustVX: First, cluster visually similar web page elements

Page 12: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

ClustVX: Second, analyze clusters to identify data records

Page 13: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Experiments: Data Sets

• To evaluate ClustVX approach we use the following three publicly available benchmark datasets containing in total of 7098 data records:

• These data sets contain web search result pages generated from databases

Page 14: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Experiments: Evaluation

• We use the precision and recall measures (which are widely used in information retrieval field) to evaluate the performance of ClustVX system

Page 15: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Experiments: Results

• We compare the evaluation results of ClustVX system to other state-of-the-art automatic structured web data extraction systems.

• As shown in the following table, where the best results are marked in bold, ClustVX consistently outperforms other approaches.

Page 16: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Conclusions

• We presented ClustVX system, which, by exploiting visual and structural features of web page elements, extracts structured data.

• The preliminary evaluation of ClustVX on three publicly available benchmark data sets demonstrated, that our method can achieve very high quality in terms of precision and recall.

• Our future work will be concentrated on creating a new huge benchmark data set to test the applicability of this system in real world settings

Page 17: Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.

Thank you,

Questions?