1 Automating the Automating the Extraction of Domain Extraction of Domain Specific Information Specific Information from the Web from the Web A Case Study for the Genealogical Domain A Case Study for the Genealogical Domain Troy Walker Troy Walker Thesis Proposal Thesis Proposal January 2004 January 2004 Research funded by NSF Research funded by NSF
21
Embed
Automating the Extraction of Domain Specific Information from the Web
Automating the Extraction of Domain Specific Information from the Web. A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004. Research funded by NSF. Genealogical Information on the Web. Hundreds of thousands of sites - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
11
Automating the Extraction of Automating the Extraction of Domain Specific Information Domain Specific Information
from the Webfrom the WebA Case Study for the Genealogical DomainA Case Study for the Genealogical Domain
““Walker genealogy” on Google: 199,000 resultsWalker genealogy” on Google: 199,000 results 1 page/minute = 5 months to go through1 page/minute = 5 months to go through
Why not enlist the help of a computer?Why not enlist the help of a computer?
33
ProblemsProblems
No standard way of presenting dataNo standard way of presenting data Text formatted with HTML tagsText formatted with HTML tags TablesTables Forms to access informationForms to access information
Sites have differing schemasSites have differing schemas
44
Proposed SolutionProposed Solution
Based on Ontos and other work done by Based on Ontos and other work done by the BYU Data Extraction Group (DEG)the BYU Data Extraction Group (DEG)
Able to extract from:Able to extract from: Single-Record or Multiple Record DocumentsSingle-Record or Multiple Record Documents TablesTables FormsForms
Scalable and robust to changes in pagesScalable and robust to changes in pages Easily adaptable to other domainsEasily adaptable to other domains
55
TextText
66
TablesTables
77
FormsForms
88
FormsForms
99
System OverviewSystem Overview
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
1010
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
User QueryUser Query
Generated from ontologyGenerated from ontology Generated once per application Generated once per application
domaindomain
1111
User QueryUser Query
1212
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
URL ListURL Listand URL Selectorand URL Selector
Contains Genealogy URLsContains Genealogy URLs Search each URL—too much timeSearch each URL—too much time Select likely URLsSelect likely URLs Distribute document processing using Distribute document processing using