Automatic Extraction of Automatic Extraction of Individual and Family Individual and Family Information from Information from Primary Genealogical Primary Genealogical Records Records By By Charla Woodbury Charla Woodbury October 17, 2006 October 17, 2006
30
Embed
Automatic Extraction of Individual and Family Information from Primary Genealogical Records
Automatic Extraction of Individual and Family Information from Primary Genealogical Records. By Charla Woodbury October 17, 2006. Digital Images – Human Index. Large number of competing family history websites Digital images Human indexes – Double entry - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Automatic Extraction of Automatic Extraction of Individual and Family Individual and Family
Information from Information from Primary Genealogical Primary Genealogical
RecordsRecords
By By
Charla Woodbury Charla Woodbury October 17, 2006October 17, 2006
2
Digital Images – Human Digital Images – Human IndexIndex
• Large number of competing family history websites•Digital images
•Human indexes – Double entry
• Researchers hunting through records and indexes to put families together
3
ProblemProblem
Large amounts of primary genealogical Large amounts of primary genealogical datadata
Big projects to index and extract recordsBig projects to index and extract records
Two independent indexers and Two independent indexers and adjudicationadjudication
Millions of human hours used to index or Millions of human hours used to index or match records for names and familiesmatch records for names and families
Create a specialized extraction Create a specialized extraction ontology to interpret and label ontology to interpret and label genealogical datagenealogical data
Develop expert logic and rules thatDevelop expert logic and rules that Match and merge individuals Match and merge individuals
Group them into familiesGroup them into families
5
MethodsMethods
Prepare for the records extractionPrepare for the records extraction
Run a 1Run a 1stst PASS to extract the PASS to extract the informationinformation
Run a 2Run a 2ndnd PASS to match individuals PASS to match individuals and link familiesand link families
Evaluate and optimize the resultsEvaluate and optimize the results
6
Prepare for Records Prepare for Records ExtractionExtraction
Build an Ontology Build an Ontology BYU ontology software BYU ontology software Ontos Ontos to interpret and to interpret and
correctly label genealogical data usingcorrectly label genealogical data using DataframesDataframes Regular expressions Regular expressions LexiconsLexicons Conversion functionsConversion functions
““encapsulates knowledge about the appearance, encapsulates knowledge about the appearance, behavior, and context of a collection of data behavior, and context of a collection of data elements” Dr. David Embley elements” Dr. David Embley
Collect machine-readable recordsCollect machine-readable records
7
Ontology – Entity LevelOntology – Entity Level
8
Danish Danish GIVEN NAMEGIVEN NAME LEXICONLEXICON
MALEMALE Anders –And.Anders –And. AndreasAndreas Christen –KristenChristen –Kristen Christian –KristianChristian –Kristian Erik –EricErik –Eric GregersGregers HansHans Ib –Jep –JeppeIb –Jep –Jeppe JacobJacob JensJens Johan – Johannes – Joh.Johan – Johannes – Joh. Jorgen –JørgenJorgen –Jørgen KnudKnud Lars – Laurs – Laurids –LauritzLars – Laurs – Laurids –Lauritz Mads –Mats - MatsMads –Mats - Mats
FEMALEFEMALE Ane – Anna – AnneAne – Anna – Anne Birthe – BirteBirthe – Birte BodilBodil CarolineCaroline Dorthe – DorteDorthe – Dorte Ellen -Helene -EleneEllen -Helene -Elene Elisabeth –Elsbeth –LisbethElisabeth –Elsbeth –Lisbeth Else –IlseElse –Ilse IngeborgIngeborg IngerInger KarenKaren Kirsten –Christen –Kirstine –Kirsten –Christen –Kirstine –
[BODY][BODY] TruustTruust Dom. 23 p: Trinit: Dom. 23 p: Trinit: laest laest over over Niels BachesNiels Baches SØRENSØREN fadd.fadd. Johannes MichelsensJohannes Michelsens og og NielsNiels Mollers Mollers hustruerhustruer af af SøebyevadSøebyevad, , Peder Peder RasmussenRasmussen af af SøebyevadSøebyevad, , Jens BachisJens Bachis sønsøn PederPeder og og Niels ThylkesNiels Thylkes s.s. PederPeder af af TruustTruust
19
Populate RDF-data filePopulate RDF-data file
Hilton Campbell’s designHilton Campbell’s design
PERSONPERSON
EVENTEVENT
LINKS – PERSON(S) to EVENTLINKS – PERSON(S) to EVENT
20
EVENT – EVENT – birth of Rachelbirth of RachelPERSON’s – PERSON’s – SarahSarah and and
RachelRachel
21
3 Run a SECOND PASS to 3 Run a SECOND PASS to match individuals and to match individuals and to
link familieslink families FORMULATE RULES FORMULATE RULES
in Rule Engine language for RDF-data file in Rule Engine language for RDF-data file
Match individualsMatch individuals
Check family dataCheck family data
Link families upLink families up
APPLY RULES through the Java Rules APIAPPLY RULES through the Java Rules API
22
44 Evaluate and Optimize Evaluate and Optimize ResultsResults
Evaluate the preliminary resultsEvaluate the preliminary results
Optimize the rulesOptimize the rules
Improve the whole processImprove the whole process
23
VALIDATION IVALIDATION IClassification by Record Type:Classification by Record Type:
980 Entries TOTAL LABELED ‘NAME’980 Entries TOTAL LABELED ‘NAME’
The higher the number, the betterThe higher the number, the better
25
Isaac WOODBURYIsaac WOODBURY ChildrenChildren
1.1. Robert 4 Jul 1672Robert 4 Jul 16722.2. Mary 6 Oct 1674Mary 6 Oct 16743.3. Christian 3 Mar 1677/8Christian 3 Mar 1677/84.4. Isaac 6 Apr 1680Isaac 6 Apr 16805.5. Deliverance 1 Feb 1682/3Deliverance 1 Feb 1682/36.6. Joshua 1 Jan 1684/5Joshua 1 Jan 1684/57.7. Elizabeth 17 Jan 1688Elizabeth 17 Jan 16888.8. Nickolas 12 Aug 1688Nickolas 12 Aug 16889.9. AnnAnn 29 Jun 168929 Jun 168910.10. Lidia 1 Feb 1691/2Lidia 1 Feb 1691/211.11. Elisabeth about 1694Elisabeth about 169412.12. Isaac 20 Jul 1697Isaac 20 Jul 169713.13. Benjamin 20 Aug 1699Benjamin 20 Aug 1699
26
Isaac WOODBURYIsaac WOODBURY SON of SON of HUMPHREYHUMPHREY
Mary WILKESMary WILKES MARRIAGE 9 Oct MARRIAGE 9 Oct
16711671
1.1. Robert 4 Jul 1672Robert 4 Jul 16722.2. Mary 6 Oct 1674Mary 6 Oct 16743.3. Christian 3 Mar Christian 3 Mar
1677/81677/84.4. Isaac 6 Apr 1680Isaac 6 Apr 16805.5. Deliverance 1 Feb Deliverance 1 Feb
1682/31682/36.6. Joshua 1 Jan 1684/5Joshua 1 Jan 1684/57.7. Elizabeth 17 Jan Elizabeth 17 Jan
16881688
Isaac WOODBURYIsaac WOODBURY SON of SON of NICHOLASNICHOLAS
ElizabethElizabeth MARRIAGE ________MARRIAGE ________
1.1. Nickolas 12 Aug Nickolas 12 Aug 16881688
2.2. AnnAnn 29 Jun 168929 Jun 16893.3. Lidia 1 Feb 1691/2Lidia 1 Feb 1691/24.4. Elisabeth about Elisabeth about
169416945.5. Isaac 20 Jul 1697Isaac 20 Jul 16976.6. Benjamin 20 Aug Benjamin 20 Aug
16991699
27
VALIDATION IIIVALIDATION III
Grouping by FAMILY:Grouping by FAMILY:
total # merges + splits to correct families total # merges + splits to correct families after after 22ndnd PASS PASS
______________________________________________________________________total # merges + splits to correct families total # merges + splits to correct families
after after 11stst PASS PASS
The lower the number, the betterThe lower the number, the better
28
Optimize the RulesOptimize the Rules AddAdd
RemoveRemove
Fine-tuneFine-tune
Change the order Change the order
Improve the whole processImprove the whole processUntil the metrics no Until the metrics no