Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National Science Foundation
Dec 19, 2015
Query Rewriting for Extracting Data Behind HTML Forms
Xueqi Chen
Department of Computer Science
Brigham Young University
March, 2003
Funded by National Science Foundation
Motivation
• Web information is stored in databases• Databases are accessed through forms• Automated agents are of great value• Process is difficult because of nature of forms
System Flowchart
Input Analyzer
Retrieved Page(s)
Application Ontology
User Query
Site Form
Output Analyzer
Extracted Information
User Query Acquisition
Our system provides a form created based on application-specific ontology
Site Form Analysis
Understand type, name, and/or values for each field
Form Filling
Name matching Regular Expressions – for fields with values provided Stemming Levenshtein Edit Distance Longest Common Subsequences Soundex Wordnet
Value matching
Value Matching: Case 1
Value Matching: Case 2 ??
Value Matching: Case 3
Color?
??
Value Matching: Case 4
Value Matching: Case 5
?
Value Matching: Case 6
Value Matching: Case 7
Measurements
Matching Efficiency Submission Efficiency Post-processing Efficiency
Measurements (cont’)
Matching Efficiency
matchedbeen have could that fields of No.
fields matchedcorrectly of No.recall
fields matched of No.
fields matchedcorrectly of No.precision
Measurements (cont’)
Matching Efficiency Submission Efficiency
submittedbeen have could that queries of No.
submitted queriescorrect of No.recall
queries submitted of No.
submitted queriescorrect of No.precision
Measurements (cont’)
Matching Efficiency Submission Efficiency Post-processing Efficiency
returnedbeen have could that records of No.
returned systemour recordscorrect of No.recall
returned systemour records of No.
returned systemour recordscorrect of No.precision
Contributions
It enhances the effectiveness of the data-extraction process
It presents another technique, in addition to [RGa01], to access data behind HTML forms.