1 Prof. Yuan-Shyi Peter Chiu Feb. 2011 Material Management Class Note #1-A MRP – Capacity Constraints
Dec 21, 2015
11
Querying the Web for Querying the Web for Genealogical InformationGenealogical Information
Troy WalkerTroy Walker
Spring Research Conference Spring Research Conference 20032003
Research funded by NSFResearch funded by NSF
22
Genealogical Information on Genealogical Information on the Webthe Web
Hundreds of thousands of sitesHundreds of thousands of sites Some professional (Ancestry.com, Some professional (Ancestry.com,
Familysearch.org)Familysearch.org) Mostly hobbyist (Cyndislist.com)Mostly hobbyist (Cyndislist.com)
Search enginesSearch engines ““Walker genealogy” on Google: 199,000 resultsWalker genealogy” on Google: 199,000 results 1 page/minute = 5 months to go through1 page/minute = 5 months to go through
Why not enlist the help of a computer?Why not enlist the help of a computer?
33
ProblemsProblems
No standard way of presenting dataNo standard way of presenting data Text formatted with HTML tagsText formatted with HTML tags TablesTables Forms to access informationForms to access information
Each site has its own idea of what Each site has its own idea of what genealogical information is—differing genealogical information is—differing schemasschemas
44
Proposed solutionProposed solution
Based on Ontos and other work done at Based on Ontos and other work done at the BYU Data Extraction Groupthe BYU Data Extraction Group
Able to extract from:Able to extract from: Semi-structured or unstructured textSemi-structured or unstructured text TablesTables FormsForms
Scalable and robust to changes in pagesScalable and robust to changes in pages Built for genealogy but easily adaptable to Built for genealogy but easily adaptable to
other domainsother domains
55
TextText
66
TablesTables
77
FormsForms
88
FormsForms
99
System OverviewSystem Overview
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
To be implementedTo be improvedTo be integrated
1010
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
User QueryUser Query
Form generated from ontologyForm generated from ontology Query by exampleQuery by example
1111
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
URL DatabaseURL Databaseand Document Retrieverand Document Retriever
Contains Genealogy URLsContains Genealogy URLs Search each URL—too much timeSearch each URL—too much time Filter likely URLsFilter likely URLs
URLURL FilterFilter
http://www.ancestry.com/http://www.ancestry.com/search/main.htm?lfl=advsearch/main.htm?lfl=adv
http://http://userdb.rootsweb.com/userdb.rootsweb.com/deaths/cgi-bin/deaths.cgideaths/cgi-bin/deaths.cgi
Death Date > Death Date > 18801880
http://www.camcomp.com/http://www.camcomp.com/users/jwalker/johngene/users/jwalker/johngene/johngenes.htmjohngenes.htm
Name: Bates, Name: Bates, Boyle, Damon, Boyle, Damon, Eliot, … Walker, Eliot, … Walker, WoodsworthWoodsworth
http://www.rootsweb.com/http://www.rootsweb.com/~gaupson/cedarcem.htm~gaupson/cedarcem.htm
Burial Location:Burial Location:
Thomaston, GAThomaston, GA
http://www.cs.utk.edu/http://www.cs.utk.edu/~dwalker/genealogy/LISTS/~dwalker/genealogy/LISTS/Adams.htmlAdams.html
Name: AdamsName: Adams
http://www.cs.utk.edu/http://www.cs.utk.edu/~dwalker/genealogy/LISTS/~dwalker/genealogy/LISTS/Walker.html Walker.html
Name: WalkerName: Walker
http://www.cs.utk.edu/http://www.cs.utk.edu/~dwalker/genealogy/LISTS/~dwalker/genealogy/LISTS/Warley.htmlWarley.html
Name: WarleyName: Warley
http://http://homepages.rootsweb.com/homepages.rootsweb.com/~gemmell/walkdesc.htm~gemmell/walkdesc.htm
Name: WalkerName: Walker
http://http://www.smartnouveau.com/www.smartnouveau.com/jbplace/Kemp/f0000425.htmljbplace/Kemp/f0000425.html
Name: Anderson, Name: Anderson, Burt, Summers, Burt, Summers, WalkerWalker
1212
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
Method SelectorMethod Selector
Analyze pageAnalyze page Select appropriate methodSelect appropriate method
1313
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
Preprocessing EnginesPreprocessing Engines
TextText Improved record-separationImproved record-separation Ability to handle single-record pagesAbility to handle single-record pages
TableTable FormsForms
1414
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
Extraction EngineExtraction Engine
OntosOntos Cache schema matchesCache schema matches
1515
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
Result FilterResult Filter
Filters objects Filters objects relevant to queryrelevant to query
Presents to userPresents to user
PersoPersonn
NameName GendeGenderr
11 Ezra Erastus WalkerEzra Erastus Walker MM
PersoPersonn
EventEvent DateDate LocationLocation
11 BirthBirth 27 Sep 27 Sep 1885 1885
Taylor, Apache, Taylor, Apache, AZAZ
11 DeathDeath 19 Sep 19 Sep 19521952
1616
ConclusionConclusion
Integrates, builds on previous DEG workIntegrates, builds on previous DEG work Extracts from:Extracts from:
Semi-structured or unstructured textSemi-structured or unstructured text TablesTables FormsForms
Scalable—only searches probable pagesScalable—only searches probable pages Robust to changes in pagesRobust to changes in pages Ontology based—easily adapted to other Ontology based—easily adapted to other
domainsdomains