Top Banner
Ancestry.com Data: DEG Results, Observations, & Future Directions
33

Ancestry.com Data: DEG Results, Observations, & Future Directions.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Ancestry.com Data:DEG Results, Observations,

& Future Directions

Page 2: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Acknowledgements

Page 3: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Extraction Ontologies• EOs:

– Conceptual models for some domain– Data frames for each concept

• Instance recognizers & IO converters• Operations with recognizers and instance recognizers for parameters

– Used for information extraction, free-form query processing, information integration, semantic-web applications, …

• But, for the Ancestry.com project, only instance recognition – Regular-expression recognition rules– Dictionary-based recognition rules– Combinations– Heuristic rules

• Disambiguation within context• Specialized case sensitivity

Page 4: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Rules – Names

• Dictionaries– First & last names

• From 1990 census (http://www.census.gov/genealogy/names/) • Minus stopwords (http://www.lextek.com/manuals/onix/stopwords2.html)

– Titles dictionary: {Mr, Mrs, Miss, Dr, Rev, …}

• Simple regular expressions• Dictionary & regular-expression combinations

Page 5: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Rules – Putting it all together

• Baseline: dictionary– Two or more consecutive dictionary words– Require capitalization

• Extraction Ontology (EO)– Regular expressions– Exclusions: schools, addresses, …

• Patterns (various)– Based on document structure– Not generally applicable elsewhere

Page 6: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Extraction Ontology (EO)• \b({Title}\s+){0,1}{First}\s+([A-Z]\s+){0,1}{Last}\b

– John Doe – Mr John Q Doe

• \b{Title}\s+{Last}\b– Mr Doe

• \b{Title}( [A-Z][A-Za-z]*){1,3}\b– Mr Anythingcapitalized Upto Threewords

• \b({Last})(\s+{Title})?(\s+{First}|\s+[A-Z]){1,2}\b– Doe John– Doe Miss Jane– Doe Anything Capitalized

• Also exclusions

Page 7: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Numbers

• Using training data examples• Should use dev-test numbers

Page 8: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Results – Example 1

THE BLAKE FAMILY IN ENGLAND IN a Genealogical History of William Blake of Dorchester published in 1857 appears the statement that the emigrant to New England was the son of Giles Blake of Little Baddow Essex and the record of several generations of the family is given The sub - stance of this record is trustworthy as being a copy from Morant ' s History of Essex but the statement that the Dorchester settler was of this family was unwarranted by any evidence Subsequently the late H G Somerby Esq by request of Stanton Blake Esq made extended researches in England to determine the origin of the American family He finally located it at Over Stcwey Somerset and the results of his investigations were published in 1881 by W H Whitmore Esq in A Record of the Blakes of Somersetshire The evidences upon which Mr Somerby based his conclusions - were first the record of aN baptism in 1594 at Over Stowey of a William Blake ……

Page 9: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Results – Example 1

Correct matches: William Blake Giles Blake William Blake Mr Blake William Blake

Missed entities: H G Somerby Esq Stanton Blake Esq W H Whitmore Esq Mr Somerby Rev Charles M Blake William Blake Esq William Arthur Jones Esq A M Edward J Blake Esq

False positives: Stanton Blake Rev Charles William Arthur Jones William Blake

Method Correct Precision Recall F1Dictionary 5 / 13 55.56% 38.46% 45.45%

Results measured on set-aside TRAINING data

Page 10: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Results – Example 1

Correct matches: William Blake Giles Blake William Blake Mr Blake William Blake

Missed entities: H G Somerby Esq Stanton Blake Esq W H Whitmore Esq Mr Somerby Rev Charles M Blake William Blake Esq William Arthur Jones Esq A M Edward J Blake Esq

False positives: Stanton Blake Rev Charles William Arthur Jones William Blake

Method Correct Precision Recall F1Dictionary 5 / 13 55.56% 38.46% 45.45%(Adjusted) 8 / 13 88.89% 61.54% 72.73%

Results measured on set-aside TRAINING data

Page 11: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Results – Example 1

Correct matches: William Blake Giles Blake Mr Somerby William Blake Rev Charles M Blake Mr Blake William Blake

Missed entities: H G Somerby Esq Stanton Blake Esq W H Whitmore Esq William Blake Esq William Arthur Jones Esq A M Edward J Blake Esq

False positives: Stanton Blake William Arthur William Blake Edward J Blake

Method Correct Precision Recall F1Dictionary 5 / 13 55.56% 38.46% 45.45%(Adjusted) 8 / 13 88.89% 61.54% 72.73%EO 7 / 13 63.64% 53.85% 58.33%

Results measured on set-aside TRAINING data

Page 12: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Results – Example 1

Correct matches: William Blake Giles Blake Mr Somerby William Blake Rev Charles M Blake Mr Blake William Blake

Missed entities: H G Somerby Esq Stanton Blake Esq W H Whitmore Esq William Blake Esq William Arthur Jones Esq A M Edward J Blake Esq

False positives: Stanton Blake William Arthur William Blake Edward J Blake

Method Correct Precision Recall F1Dictionary 5 / 13 55.56% 38.46% 45.45%(Adjusted) 8 / 13 88.89% 61.54% 72.73%EO 7 / 13 63.64% 53.85% 58.33%(Adjusted) 10 / 13 90.90% 76.92% 83.33%

Results measured on set-aside TRAINING data

Page 13: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Results – Example 2

msM & ffi , 113 2lst Street . Wholesale and Retail dealers inall grade * of Cook and Heating stoves . Also the celebra tedJno , Vann , Wm . Kiser and Char * ter OakBangea . 1S4 K . L .POLK & CO . ' S Coker Henry W , dry goods and grocer ,Washington ave bet Hiekman and Maple , Irondale CokeNewton , lab Aliee Furnace Coker Wesley T , elk W R Coker ,res Washington ave cor Hiekman , Irondale Coker Wm R , drygood . and grocer , Washington ave , cor Hiekman , IrondaleC ' ola Carlo , lunch house . 19 20th s , res same Colby Mips Alice , bds 280 ! ) 2d ave ColdirrU see also { { . ' aldircll Coldwell K A , bds 22 : 50 4th ave Coldwell Wm ,…

Birmingham City Directory

Page 14: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Example 2 - Dictionary

Results measured on set-aside TRAINING data

Correct matches: Coke Newton Coldwell Wm Cole Alexander Cole Burt Cole Charles Wm Hood Cole Frank Cole James Cole John Cole John … and others …

Missed entities: Coker Henry W Coker Wesley T Coker Coker Wm R C ' ola Carlo Colby Mips Alice Coldwell K A Cole Artolphus … and others…

False positives: Coker Henry Furnace Coker Wesley Coker Wm Rolling Mill Peter Zins Birmingham Rolling Mill Cleveland Cole Charles Cole Franklin Cole James Cole John … and others …

Method Correct Precision Recall F1Dictionary 31 / 78 51.67% 39.74% 44.93%

Page 15: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Example 2 - EO

Results measured on set-aside TRAINING data

Correct matches: Coker Henry W Coke Newton Coker Wesley T Coker Wm R Coldwell K A Coldwell Wm Cole Alexander Cole Burt Cole Charles Wm Hood … and others …

Missed entities: Coker C ' ola Carlo Colby Mips Alice Cole Artolphus Cole Cradford Cole Charles H ] ) orter II Herxfcld Cole P W … and others…

False positives: Peter Zins Cleveland Cole Charles Furnace Cole P Peter Zins Cole Samuel I Alice Furnace Bains Cole Wm Cole Win Bessemer Cole Wm Davis Coleman Bettie … and others …

Method Correct Precision Recall F1Dictionary 31 / 78 51.67% 39.74% 44.93%EO 47 / 78 72.31% 60.26% 65.73%

Exclusion patterns: no effect

Page 16: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Can we do better?What if we knew the record boundaries?

Manually add boundaries…

Use the knowledge that a record usually starts with a name…

This will measure the value of boundary information.

Page 17: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Example 2 – Segmented

msM & ffi , 113 2lst Street . Wholesale and Retail dealers inall grade * of Cook and Heating stoves . Also the celebra tedJno , Vann , Wm . Kiser and Char * ter OakBangea . 1S4 K . L .POLK & CO . ' S ##Coker Henry W , dry goods and grocer , Washington ave bet Hiekman and Maple , Irondale ##Coke Newton , lab Aliee Furnace ##Coker Wesley T , elk W R ##Coker , res Washington ave cor Hiekman , Irondale ##Coker Wm R , dry good . and grocer , Washington ave , cor Hiekman , Irondale ##C ' ola Carlo , lunch house . 19 20th s , res same ##Colby Mips Alice , bds 280 ! ) 2d ave ##ColdirrU see also { { . ' aldircll ##Coldwell K A , bds 22 : 50 4th ave ##Coldwell Wm ,…

Birmingham City Directory

Page 18: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Example 2 – Segmented (Ideal)

Results measured on set-aside TRAINING data

Correct matches: Coker Henry W Coke Newton Coker Wesley T Coker Coker Wm R C ' ola Carlo Colby Mips Alice Coldwell K A … and others …

Missed entities: Wm Hood ] ) orter II Herxfcld ( t \ \ ' Bains j > orter Tompson Francis iV Cheiiovveth Cole Wm C Cole J L Davis … and others…

False positives: ColdirrU see also { { Cole Wm C ( Reamer Cole & Co )

Method Correct Precision Recall F1Dictionary 31 / 78 51.67% 39.74% 44.93%EO 47 / 78 72.31% 60.26% 65.73%Segmented* 65 / 78 95.59% 83.33% 89.04%

Page 19: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Example 2 – Segmented + EO (Ideal)

Results measured on set-aside TRAINING data

Method Correct Precision Recall F1Dictionary 31 / 78 51.67% 39.74% 44.93%EO 47 / 78 72.31% 60.26% 65.73%Segmented* 65 / 78 95.59% 83.33% 89.04%Seg.*+EO 66 / 78 86.84% 84.62% 85.71%

Correct matches: Coker Henry W Coke Newton Coker Wesley T Coker Coker Wm R C ' ola Carlo Colby Mips Alice Coldwell K A … and others …

Missed entities: ] ) orter II Herxfcld ( t \ \ ' Bains j > orter Tompson Francis iV Cheiiovveth Cole Wm C Cole J L Davis E K Fulton … and others…

False positives: ColdirrU see also { { Cole Wm C ( Reamer Cole & Co ) Peter Zins Peter Zins Alice Furnace Whilden A Alice Furnace Alice Furnace Loo M

Page 20: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Results – Example 34 Index to the Precinct ReSisters of Tuolumne County i Index to the Precinct Registers of Tuolumne County 5Name co Address No Name C Address No Name ICo Address No Name Address No Gore Asihford Gray lohnA G alt John Goss An lrew Gerken Ilerman 11 Getchell Everett G Gallup Will Seneca Gr ndl W illiam Ilarten J osep IIenderson Robert Harper Edwin F Hall ……

Page 21: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Results – Example 3

Correct matches: Kelly Martin Leslie Christopher Fred Madison Charles Albert Marconi Frank Frederick

Missed entities: Gore Asihford Gray lohn A G alt John Goss An lrew Gerken Ilerman 11 Getchell Everett G Gallup Will Seneca Gr ndl W illiam Ilarten J osep IIenderson Robert … and others…

False positives: John Goss Getchell Everett Robert Harper Edwin Robert Barkley Thomas Michael Thomas Douglas Frank Hayes William George Jordan Joh Kelly Patrick Rufus Clifton Kurr … and others …

Method Correct Precision Recall F1Dictionary 4 / 72 19.05% 5.56% 8.60%

Results measured on set-aside TRAINING data

Page 22: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Results – Example 3

Correct matches: Getchell Everett G Harper Edwin F Hayes William George Kelly Patrick V Kelly Martin Kessler Peter Frederick Lang Charles Lewis Leslie Christopher Fred Madison Charles Albert Marconi Frank Frederick

Missed entities: Gore Asihford Gray lohn A G alt John Goss An lrew Gerken Ilerman 11 Gallup Will Seneca Gr ndl W illiam Ilarten J osep IIenderson Robert… and others…

False positives: John Goss Robert Barkley Thomas Michael Thomas Douglas George P Harp Jordan Joh Rufus Clifton Forrest Lumsden Paul B Christian F Morrison Robert David V Meiser Frederick S John Felix

Method Correct Precision Recall F1Dictionary 4 / 72 19.05% 5.56% 8.60%EO 10 / 72 41.67% 13.89% 20.83%

Results measured on set-aside TRAINING data

Page 23: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Results – Example 3

Correct matches: Gore Asihford Gray lohn A G alt John Goss An lrew Getchell Everett G Gallup Will Seneca Gr ndl W illiam Ilarten J osep… and others …

Missed entities: Gerken Ilerman 11 HIarlp'r C harles Frank IIughes Jolhn .1 liughes Charles .1 Haill Ed ward .1 Jordan Joh nAlfred 1honalp I Lind say Alexalnder… and others …

False positives: Name co Address No Name C Address No Name ICo Address No Name Address No Gerken Ilerman HIarlp IIughes Jolhn liughes Charles Haill Ed ward Jordan Joh nAlfred Bi ak Flat… and others …

Method Correct Precision Recall F1Dictionary 4 / 72 19.05% 5.56% 8.60%EO 10 / 72 41.67% 13.89% 20.83%New lines* 62 / 72 73.81% 86.11% 79.49%

Results measured on set-aside TRAINING data

Pattern 1: Alphabetical data up to new lines,all columns

Page 24: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Results – Example 3

Correct matches: Gore Asihford Gray lohn A G alt John Goss An lrew Getchell Everett G Gallup Will Seneca Gr ndl W illiam Ilarten J osep IIenderson Robert… and others …

Missed entities: Gerken Ilerman 11 HIarlp'r C harles Frank IIughes Jolhn .1 liughes Charles .1 Haill Ed ward .1 Jordan Joh nAlfred 1honalp I Lind say Alexalnder… and others …

False positives: Gerken Ilerman HIarlp IIughes Jolhn liughes Charles Haill Ed ward Jordan Joh nAlfred Lind say Alexalnder i C urpliy PaI Mocalrtoe lhester lioujalin i

Method Correct Precision Recall F1Dictionary 4 / 72 19.05% 5.56% 8.60%EO 10 / 72 41.67% 13.89% 20.83%New lines1 62 / 72 73.81% 86.11% 79.49%New lines2 62 / 72 87.32% 86.11% 86.71%

Results measured on set-aside TRAINING data

Pattern 2: Alphabetical data up to new lines,selected columns

Page 25: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Results – Example 3

Correct matches: Gore Asihford G alt John Getchell Everett G Gr ndl W illiam IIenderson Robert Hall CharlesPs rry Ilartoen Thomas Douglas Hayes William George… and others …

Missed entities: Gray lohn A Goss An lrew Gerken Ilerman 11 Gallup Will Seneca Ilarten J osep Harper Edwin F liarlan Robert Barkley… and others …

False positives: Gerken Ilerman HIarlp'r C harles Frank liughes Charles . Haill Ed ward . Jordan Joh nAlfred Lind say Alexalnder i C urpliy PaI'trick Mocalrtoe lhester lioujalin i

Method Correct Precision Recall F1Dictionary 4 / 72 19.05% 5.56% 8.60%EO 10 / 72 41.67% 13.89% 20.83%New lines1 62 / 72 73.81% 86.11% 79.49%New lines2 62 / 72 87.32% 86.11% 86.71%New lines3 45 / 72 84.91% 62.5% 72.00%

Results measured on set-aside TRAINING data

Pattern 3: All data up to new line or digit,selected columns

Page 26: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Context Exclusion

• Name not part of something else– Address – Hollis Long Island N Y– Schools – Trenton State Normal School– Companies – K. L. Polk & Co.

• Note: It might be interesting to recognize these names too, but mark them as being part of something else.

Page 27: Ancestry.com Data: DEG Results, Observations, & Future Directions.

EO Results – Example 4

I GERTRUDE SMITH (Mrs William E Haines deceased) Married shortly after graduation Died at age of 22 Was musician and taught piano lessons 1898 HOBART L BENEDICT Millburn Essex County N J Graduated from Rutgers 1902 and from New York Law School in 1904 with degrees of B Sc M Sc and LL B Married April 9 1907 to Martha C Bunnell One daughter Elizabeth Benedict Counsellor at law with offices in Elizabeth and Millburn MARTHA BUNNELL (Mrs Hobart L Benedict) Millburn Essex County N J Married to Hobart L Benedict on date above 1899 CORA SMITH (Mrs Louis Slingerland) 557 Third St South St Peters- burg Florida Married ……

Page 28: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Example 4 - Dictionary

Correct matches: Elizabeth Benedict MARTHA BUNNELL Mrs Louis Slingerland Louis Slingerland Mrs Harry Engel Mrs Leslie Ward

Missed entities: GERTRUDE SMITH Mrs William E Haines HOBART L BENEDICT Martha C Bunnell Mrs Hobart L Benedict Hobart L Benedict CORA SMITH Mr Slingerland JENNIE HAINES STELLA ILLSLEY WALTER BOSCHEN GEORGE McQUAIDE MARGARET HAINES ABBY HEADLEY CLARENCE GRIGGS

False positives: Mrs William York Law School Mrs Hobart High School Mr Slingerland Ave Union Union School Trenton Looker School Union Town Hollis Long Island Morris Ave Union Battin High School Rutgers College Ave Union Ave Union Trenton State School Roselle

Results measured on TRAINING data

Method Correct Precision Recall F1Dictionary 6 27.27% 28.57% 27.91%

Page 29: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Example 4 - Tuned

Correct matches: GERTRUDE SMITH Mrs William E Haines Martha C Bunnell Elizabeth Benedict MARTHA BUNNELL Mrs Hobart L Benedict CORA SMITH Mrs Louis Slingerland Louis Slingerland Mr Slingerland JENNIE HAINES STELLA ILLSLEY Mrs Harry Engel GEORGE McQUAIDE MARGARET HAINES ABBY HEADLEY Mrs Leslie Ward CLARENCE GRIGGS

Missed entities: HOBART L BENEDICT Hobart L Benedict WALTER BOSCHEN

False positives: Hollis Long

Results measured on TRAINING data

Method Correct Precision Recall F1Dictionary 6 27.27% 28.57% 27.91%Tuned EO 18 94.74% 85.71% 90.00%

Without exclusion 18 85.71% 85.71% 85.71%Additional false positives:• Morris Ave• Trenton State

Page 30: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Example 4 – Broadened

Correct matches: Mrs William E Haines HOBART L BENEDICT Martha C Bunnell Elizabeth Benedict MARTHA BUNNELL Mrs Hobart L Benedict Hobart L Benedict Mrs Louis Slingerland Louis Slingerland Mr Slingerland Mrs Harry Engel Mrs Leslie Ward

Missed entities: GERTRUDE SMITH CORA SMITH JENNIE HAINES STELLA ILLSLEY WALTER BOSCHEN GEORGE McQUAIDE MARGARET HAINES ABBY HEADLEY CLARENCE GRIGGS

False positives: Hollis Long Island N Y Union N J Springfield N J Union N J Elizabeth N J Newark N J Union N J Newark N J

Results measured on TRAINING data

Method Correct Precision Recall F1Dictionary 6 27.27% 28.57% 27.91%Tuned EO 18 94.74% 85.71% 90.00%Broadened 12 57.14% 57.14% 57.14%

Candidates for exclusion(place names)

Page 31: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Example 4 – Patterns

Correct matches: GERTRUDE SMITH MARTHA BUNNELL Mrs Hobart L Benedict CORA SMITH Mrs Louis Slingerland JENNIE HAINES STELLA ILLSLEY Mrs Harry Engel WALTER BOSCHEN MARGARET HAINES ABBY HEADLEY Mrs Leslie Ward CLARENCE GRIGGS

Missed entities: Mrs William E Haines HOBART L BENEDICT Martha C Bunnell Elizabeth Benedict Hobart L Benedict Louis Slingerland Mr Slingerland GEORGE McQUAIDE

False positives:None

Results measured on TRAINING data

Method Correct Precision Recall F1Dictionary 6 27.27% 28.57% 27.91%Tuned EO 18 94.74% 85.71% 90.00%Broadened 12 57.14% 57.14% 57.14%Pattern* 13 100.00% 61.90% 76.47%

Patterns:ALL CAPS, multiple words of multiple lettersInitial capitals inside of parentheses

Page 32: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Extraction Ontologies – circa May 09

• May 09 version of the EO engine– OCR problems reduce potential accuracy– Can tune rules, potentially even for OCR, but …

• Summer 09 observations– Document patterns can help– But must apply judiciously

• Discover Pattern (possibly more than one simultaneously)• Discover Extent of Pattern

Page 33: Ancestry.com Data: DEG Results, Observations, & Future Directions.

Future Work• Test-set trials for EO

– Code backend conversion to work with Thomas’s evaluator– Tune expressions, as needed, on training set, …– Run trials

• Revise EO engine for patterns– Pattern Discovery

• Extract with dictionary• Test for patterns

– Internal patterns such as all caps or name order– External patterns such as within parens, bounded by …

– Pattern-Extent Discovery• Pattern sequence begin and end• Multiple patterns within the same extent