YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Extracting Names Using Layout Clues

in Genealogical Books

Aaron Stewart

David W. Embley

March 20, 2010

Page 2: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Problem

Page 3: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Process

Page 4: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Finding Names

• Name recognition in genealogical texts

• Focus: Lists, Directories

Page 5: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Finding Names

It’s easy for us to spot names… But how does a computer do it?

Which side was easier?

Page 6: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Finding Names

Stanford Named Entity Recognizer

Apache UIMA Framework

CRF MEMM

Natural Language Processing

?

Page 7: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

BYU OntoES Ontology Extraction System

• Dictionary

• Regular Expressions

Page 8: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Part 1: Preprocessing

Page 9: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Ancestry.com Data

• Word text

• Word bounding boxes

• Genres:– Genealogical Books– City Directories– Yearbooks– Newspapers

Page 10: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Page Separator

Page 11: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Line Segment Identifier

Page 12: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

RANSAC Margin Finder

Page 13: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Margin Finder – Future Work

LeftCenter

Right

Key

Page 14: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Margin Finder – Future Work

• ABBYY FineReader handles –– Paragraphs– Newspaper columns

• But has trouble with –– Hanging indents– Outline indentation (possibly)

Page 15: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Part 2: Pattern Finding

Page 16: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Pattern Finding

1. Apply baseline name extractor (OntoES)

2. Apply margin finder and insert markers

3. Find left and right context for each name

4. Apply common contexts to extract more names

Page 17: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Pattern Finding

1. Apply baseline name extractor (OntoES)

Page 18: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Pattern Finding

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 2

LEVEL 2

2. Apply margin finder and insert markers

Page 19: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Pattern Finding

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 2

LEVEL 2

3. Find left and right context for each name

Page 20: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Pattern Finding

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 2

LEVEL 2

4. Apply common context patterns to extract more names

Page 21: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Pattern Finding – Sample Results

Baseline Results• Precision: 40%• Recall: 31.25%• F1: 35.09%

Results of Most Salient Pattern• Precision: 51.52%• Recall: 53.12%• F1: 52.31% Not all results are this good!

Page 22: Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Challenges

• Evaluation– More aligned data– Annotation tool

• Other books– Centered and right-aligned text– Knowing when to apply patterns


Related Documents