Top Banner
ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001
16

R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

ROADRUNNER: Towards Automatic Data Extraction

from Large Web Sites

Valter Crescenzi

Giansalvatore Mecca

Paolo Merialdo

VLDB 2001

Page 2: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Overview Automatically generates a wrapper from large

structured Web pages Supports nested structures Efficient approach to large, complex pages with

regular structures

Page 3: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Approach Given a set of example pages Generate a Union-free Regular Expression

(UFRE) Find the least upper bounds on the RE lattice

to generate a wrapper Reduces to find the least upper bound on two

UFRES

Page 4: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Matching/Mismatching Start with the first page and create a RE that defines

the wrapper Match each successive sample against the wrapper Mismatches result in generalizations of the regular

expression Types of mismatches

– String mismatches– Tag mismatches

Page 5: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Example Pages

Page 6: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Example

#PCDATA

String mismatches are used to discover fields of the documents

Wrapper is generated by replacing “John Smith” with #PCDATA

Page 7: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Example (Cont.)

#PCDATA

Tag Mismatches :Discovering Optionals First check to see if mismatch is caused by an iterator If not, could be an optional field in wrapper or sample Cross search used to determine possible optionals Image field determined to be optional

– (<img src=…/>)?

Page 8: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Example (Cont.)

#PCDATA

Tag Mismatches :Discovering Optionals First check to see if mismatch is caused by an iterator If not, could be an optional field in wrapper or sample Cross search used to determine possible optionals Image field determined to be optional

– (<img src=…/>)?

(<IMG src=…/>)?

Page 9: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Example (Cont.)

#PCDATA

(<IMG src=…/>)?

#PCDATA

#PCDATA

Tag Mismatches :Discovering Iterators Assume mismatch is caused by repeated elements in a list Match possible squares against earlier squares Generalize the wrapper by finding all contiguous repeated

occurrences– (<li><i>Title:</i>#PCDATA</li>)+

Page 10: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Extracted Result

Page 11: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Recursive Example

Page 12: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Complexity

Page 13: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Discussion Assumptions

– Pages are well-structured– Want to extract at the level of entire fields– Structure can be modeled without disjunctions

Search Space for explaining mismatches is huge– Uses a number of heuristics to prune space

Limited backtracking Limit on number of choices to explore Patterns can not be delimited by optionals

– Will result in pruning possible wrappers

Page 14: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Experimental Result

Page 15: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Comparison with Other Works

Page 16: R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Name Struc_

ture

Semi Free Single-slot

Multi-slot

Missing items

Permuta_tions

Nested_

data

Resilient

WIEN X X XSoftMealy X X X X X X*STALKER X X X * X X XRAPIER X X ? X X X ?SRV X X ? X X X ?WHISK X X X X X X X* ?AutoSlog X X X XROAD_

RUNNER X X X X XBYU Onto X X ? X X X X X X

X means the information extraction system has the capability; X* means the information extraction system

has the ability as long as the training corpus can accommodate the required training data; ? Shows that the

systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the

ability, but the overall system has the capability.