Top Banner
Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010
22

Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Dec 14, 2015

Download

Documents

Casey Bares
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Extracting Names Using Layout Clues

in Genealogical Books

Aaron Stewart

David W. Embley

March 20, 2010

Page 2: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Problem

Page 3: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Process

Page 4: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Finding Names

• Name recognition in genealogical texts

• Focus: Lists, Directories

Page 5: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Finding Names

It’s easy for us to spot names… But how does a computer do it?

Which side was easier?

Page 6: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Finding Names

Stanford Named Entity Recognizer

Apache UIMA Framework

CRF MEMM

Natural Language Processing

?

Page 7: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

BYU OntoES Ontology Extraction System

• Dictionary

• Regular Expressions

Page 8: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Part 1: Preprocessing

Page 9: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Ancestry.com Data

• Word text

• Word bounding boxes

• Genres:– Genealogical Books– City Directories– Yearbooks– Newspapers

Page 10: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Page Separator

Page 11: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Line Segment Identifier

Page 12: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

RANSAC Margin Finder

Page 13: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Margin Finder – Future Work

LeftCenter

Right

Key

Page 14: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Margin Finder – Future Work

• ABBYY FineReader handles –– Paragraphs– Newspaper columns

• But has trouble with –– Hanging indents– Outline indentation (possibly)

Page 15: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Part 2: Pattern Finding

Page 16: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Pattern Finding

1. Apply baseline name extractor (OntoES)

2. Apply margin finder and insert markers

3. Find left and right context for each name

4. Apply common contexts to extract more names

Page 17: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Pattern Finding

1. Apply baseline name extractor (OntoES)

Page 18: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Pattern Finding

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 2

LEVEL 2

2. Apply margin finder and insert markers

Page 19: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Pattern Finding

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 2

LEVEL 2

3. Find left and right context for each name

Page 20: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Pattern Finding

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 2

LEVEL 2

4. Apply common context patterns to extract more names

Page 21: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Pattern Finding – Sample Results

Baseline Results• Precision: 40%• Recall: 31.25%• F1: 35.09%

Results of Most Salient Pattern• Precision: 51.52%• Recall: 53.12%• F1: 52.31% Not all results are this good!

Page 22: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Challenges

• Evaluation– More aligned data– Annotation tool

• Other books– Centered and right-aligned text– Knowing when to apply patterns