Top Banner
1 Prof. Yuan-Shyi Peter Chiu Feb. 2011 Material Management Class Note #1-A MRP – Capacity Constraints
87

1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

11

Querying the Web for Querying the Web for Genealogical InformationGenealogical Information

Troy WalkerTroy Walker

Spring Research Conference Spring Research Conference 20032003

Research funded by NSFResearch funded by NSF

Page 2: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

22

Genealogical Information on Genealogical Information on the Webthe Web

Hundreds of thousands of sitesHundreds of thousands of sites Some professional (Ancestry.com, Some professional (Ancestry.com,

Familysearch.org)Familysearch.org) Mostly hobbyist (Cyndislist.com)Mostly hobbyist (Cyndislist.com)

Search enginesSearch engines ““Walker genealogy” on Google: 199,000 resultsWalker genealogy” on Google: 199,000 results 1 page/minute = 5 months to go through1 page/minute = 5 months to go through

Why not enlist the help of a computer?Why not enlist the help of a computer?

Page 3: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

33

ProblemsProblems

No standard way of presenting dataNo standard way of presenting data Text formatted with HTML tagsText formatted with HTML tags TablesTables Forms to access informationForms to access information

Each site has its own idea of what Each site has its own idea of what genealogical information is—differing genealogical information is—differing schemasschemas

Page 4: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

44

Proposed solutionProposed solution

Based on Ontos and other work done at Based on Ontos and other work done at the BYU Data Extraction Groupthe BYU Data Extraction Group

Able to extract from:Able to extract from: Semi-structured or unstructured textSemi-structured or unstructured text TablesTables FormsForms

Scalable and robust to changes in pagesScalable and robust to changes in pages Built for genealogy but easily adaptable to Built for genealogy but easily adaptable to

other domainsother domains

Page 5: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

55

TextText

Page 6: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

66

TablesTables

Page 7: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

77

FormsForms

Page 8: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

88

FormsForms

Page 9: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

99

System OverviewSystem Overview

DocumentRetriever

FormEngine

TableEngine

Unstructured orSemi-Structured

Text Engine

URLDatabase

UserQuery

ResultFilter

DocumentStructure

Recognizer

DataExtraction

Engine

MappingInformation

To be implementedTo be improvedTo be integrated

Page 10: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

1010

DocumentRetriever

FormEngine

TableEngine

Unstructured orSemi-Structured

Text Engine

URLDatabase

UserQuery

ResultFilter

DocumentStructure

Recognizer

DataExtraction

Engine

MappingInformation

DocumentRetriever

FormEngine

TableEngine

Unstructured orSemi-Structured

Text Engine

URLDatabase

UserQuery

ResultFilter

DocumentStructure

Recognizer

DataExtraction

Engine

MappingInformation

User QueryUser Query

Form generated from ontologyForm generated from ontology Query by exampleQuery by example

Page 11: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

1111

DocumentRetriever

FormEngine

TableEngine

Unstructured orSemi-Structured

Text Engine

URLDatabase

UserQuery

ResultFilter

DocumentStructure

Recognizer

DataExtraction

Engine

MappingInformation

DocumentRetriever

FormEngine

TableEngine

Unstructured orSemi-Structured

Text Engine

URLDatabase

UserQuery

ResultFilter

DocumentStructure

Recognizer

DataExtraction

Engine

MappingInformation

URL DatabaseURL Databaseand Document Retrieverand Document Retriever

Contains Genealogy URLsContains Genealogy URLs Search each URL—too much timeSearch each URL—too much time Filter likely URLsFilter likely URLs

URLURL FilterFilter

http://www.ancestry.com/http://www.ancestry.com/search/main.htm?lfl=advsearch/main.htm?lfl=adv

http://http://userdb.rootsweb.com/userdb.rootsweb.com/deaths/cgi-bin/deaths.cgideaths/cgi-bin/deaths.cgi

Death Date > Death Date > 18801880

http://www.camcomp.com/http://www.camcomp.com/users/jwalker/johngene/users/jwalker/johngene/johngenes.htmjohngenes.htm

Name: Bates, Name: Bates, Boyle, Damon, Boyle, Damon, Eliot, … Walker, Eliot, … Walker, WoodsworthWoodsworth

http://www.rootsweb.com/http://www.rootsweb.com/~gaupson/cedarcem.htm~gaupson/cedarcem.htm

Burial Location:Burial Location:

Thomaston, GAThomaston, GA

http://www.cs.utk.edu/http://www.cs.utk.edu/~dwalker/genealogy/LISTS/~dwalker/genealogy/LISTS/Adams.htmlAdams.html

Name: AdamsName: Adams

http://www.cs.utk.edu/http://www.cs.utk.edu/~dwalker/genealogy/LISTS/~dwalker/genealogy/LISTS/Walker.html Walker.html

Name: WalkerName: Walker

http://www.cs.utk.edu/http://www.cs.utk.edu/~dwalker/genealogy/LISTS/~dwalker/genealogy/LISTS/Warley.htmlWarley.html

Name: WarleyName: Warley

http://http://homepages.rootsweb.com/homepages.rootsweb.com/~gemmell/walkdesc.htm~gemmell/walkdesc.htm

Name: WalkerName: Walker

http://http://www.smartnouveau.com/www.smartnouveau.com/jbplace/Kemp/f0000425.htmljbplace/Kemp/f0000425.html

Name: Anderson, Name: Anderson, Burt, Summers, Burt, Summers, WalkerWalker

Page 12: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

1212

DocumentRetriever

FormEngine

TableEngine

Unstructured orSemi-Structured

Text Engine

URLDatabase

UserQuery

ResultFilter

DocumentStructure

Recognizer

DataExtraction

Engine

MappingInformation

DocumentRetriever

FormEngine

TableEngine

Unstructured orSemi-Structured

Text Engine

URLDatabase

UserQuery

ResultFilter

DocumentStructure

Recognizer

DataExtraction

Engine

MappingInformation

Method SelectorMethod Selector

Analyze pageAnalyze page Select appropriate methodSelect appropriate method

Page 13: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

1313

DocumentRetriever

FormEngine

TableEngine

Unstructured orSemi-Structured

Text Engine

URLDatabase

UserQuery

ResultFilter

DocumentStructure

Recognizer

DataExtraction

Engine

MappingInformation

DocumentRetriever

FormEngine

TableEngine

Unstructured orSemi-Structured

Text Engine

URLDatabase

UserQuery

ResultFilter

DocumentStructure

Recognizer

DataExtraction

Engine

MappingInformation

Preprocessing EnginesPreprocessing Engines

TextText Improved record-separationImproved record-separation Ability to handle single-record pagesAbility to handle single-record pages

TableTable FormsForms

Page 14: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

1414

DocumentRetriever

FormEngine

TableEngine

Unstructured orSemi-Structured

Text Engine

URLDatabase

UserQuery

ResultFilter

DocumentStructure

Recognizer

DataExtraction

Engine

MappingInformation

DocumentRetriever

FormEngine

TableEngine

Unstructured orSemi-Structured

Text Engine

URLDatabase

UserQuery

ResultFilter

DocumentStructure

Recognizer

DataExtraction

Engine

MappingInformation

Extraction EngineExtraction Engine

OntosOntos Cache schema matchesCache schema matches

Page 15: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

1515

DocumentRetriever

FormEngine

TableEngine

Unstructured orSemi-Structured

Text Engine

URLDatabase

UserQuery

ResultFilter

DocumentStructure

Recognizer

DataExtraction

Engine

MappingInformation

DocumentRetriever

FormEngine

TableEngine

Unstructured orSemi-Structured

Text Engine

URLDatabase

UserQuery

ResultFilter

DocumentStructure

Recognizer

DataExtraction

Engine

MappingInformation

Result FilterResult Filter

Filters objects Filters objects relevant to queryrelevant to query

Presents to userPresents to user

PersoPersonn

NameName GendeGenderr

11 Ezra Erastus WalkerEzra Erastus Walker MM

PersoPersonn

EventEvent DateDate LocationLocation

11 BirthBirth 27 Sep 27 Sep 1885 1885

Taylor, Apache, Taylor, Apache, AZAZ

11 DeathDeath 19 Sep 19 Sep 19521952

Page 16: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

1616

ConclusionConclusion

Integrates, builds on previous DEG workIntegrates, builds on previous DEG work Extracts from:Extracts from:

Semi-structured or unstructured textSemi-structured or unstructured text TablesTables FormsForms

Scalable—only searches probable pagesScalable—only searches probable pages Robust to changes in pagesRobust to changes in pages Ontology based—easily adapted to other Ontology based—easily adapted to other

domainsdomains