Top Banner
Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University
20

Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin

Collection

Bruce Robertson, Mount Allison University

Page 2: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

ἀλήθειαtruth

Ἀλήθεια

• ‘Breathing’ marks on vowels at beginning of a word

• Accents possible on all vowels

Page 3: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Diversity of Greek Fonts in 19th C.

Page 4: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Other Examples

Page 5: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Greek OCR With Gamera• Dalitz and Brandt provide an

experimental framework– I added splitting, grouping, sql output, etc.

• Teams of undergraduates making multiple classifiers– Based on families of fonts– Comparing strategies of composite

characters, splitting, etc.– Must also train for Latin scripts used

• Not yet working on post-processing

Page 6: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Good Results

Page 7: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.
Page 8: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Systematic Approach to Automated Greek OCR

• Remove the curator from the loop – especially important for journals, monographs, etc. – Assign classifier by computation means

• Using:– Federico Boschetti’s ground-truth-less

Greek text evaluator– Atlantic Computational Excellence

Network, Atlantic Canada’s parallel computing network

Page 9: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Process• 160 Greek-heavy texts chosen• Of these, random samples of 10

pages were taken• Each was processed with each of the

20 classifiers made this summer• The result were evaluated and given

a ‘Boschetti score’ from 0 – 1

Page 10: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

0EQ

OAAAAYAAJ

0Oku

AAAAQAAJ

0qB

EAAAAMAAJ

0w8A

NA2-pu

EC

0xc

OAAAAYAAJ

0zA

BAAAAMAAJ

14l

fAAAAM

AAJ

190

NAAAAYAAJ

1DUrA

AAAYAAJ

0

0.1

0.2

0.3

0.4

0.5

0.6

16thcentAlpha_Font

Aristides_DindorfAristides_Dindorf_1

BekkerBude

CambridgeEarly_Teubner

Etymologicumgamera-greekocr-training-loeb-separatistic-2011-04-25

KurkeLexicon

LittreLoeb_Wholistic

New_TeubnerOribase_Font

Oribase_Font_1Oribase_Font_2

Oribase_TestOxford

SmythSuper_Swirly

Super_Swirly2Teubner_Latin

Teubner_SansSerifTeubner_Similar

Teubner_Similar2Teubner_Slim

16thcentAlpha_FontAristides_DindorfAristides_Dindorf_1BekkerBudeCambridgeEarly_TeubnerEtymologicumgamera-greekocr-training-loeb-separatistic-2011-04-25KurkeLexiconLittreLoeb_WholisticNew_TeubnerOribase_FontOribase_Font_1Oribase_Font_2Oribase_TestOxfordSmythSuper_SwirlySuper_Swirly2Teubner_LatinTeubner_SansSerifTeubner_SimilarTeubner_Similar2Teubner_Slim

Page 11: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.
Page 12: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.
Page 13: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.
Page 14: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Google/ABBYY Line Splitting

Page 15: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Gamera’s Text Line Finding(bbox_merging)

Page 16: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Replaced with runlength_smearing

Page 17: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Two-step processing

Page 18: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Future Work• Combining and re-optimizing classifiers?• Assign classifier based on Latin text

– Is ‘Oxford’, ‘Clarendon’ or ‘Oxonii’ in the first pages of output?

• Align with Google’s output, and provide Google with corrected Greek

• Implement line-splitting from other OCR engines

• Discover badly OCR’d Greek in others’ output• Implement OCR correction frameworks

described here

Page 19: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Common Problems• Assessments of pre-processing

strategies and tools• Schemas for page description

Page 20: Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Thanks• Colleagues in Dynamic Variorum

Editions:– Greg Crane at Perseus / Tufts– Brian Fuchs at Imperial College

• Federico Boschetti • AceNet, especially tech. support of

Sergiy Khan