Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin

Collection

Bruce Robertson, Mount Allison University

ἀλήθειαtruth

Ἀλήθεια

• ‘Breathing’ marks on vowels at beginning of a word

• Accents possible on all vowels

Diversity of Greek Fonts in 19th C.

Other Examples

Greek OCR With Gamera• Dalitz and Brandt provide an

experimental framework– I added splitting, grouping, sql output, etc.

• Teams of undergraduates making multiple classifiers– Based on families of fonts– Comparing strategies of composite

characters, splitting, etc.– Must also train for Latin scripts used

• Not yet working on post-processing

Good Results

Systematic Approach to Automated Greek OCR

• Remove the curator from the loop – especially important for journals, monographs, etc. – Assign classifier by computation means

• Using:– Federico Boschetti’s ground-truth-less

Greek text evaluator– Atlantic Computational Excellence

Network, Atlantic Canada’s parallel computing network

Process• 160 Greek-heavy texts chosen• Of these, random samples of 10

pages were taken• Each was processed with each of the

20 classifiers made this summer• The result were evaluated and given

a ‘Boschetti score’ from 0 – 1

0EQ

OAAAAYAAJ

0Oku

AAAAQAAJ

0qB

EAAAAMAAJ

0w8A

NA2-pu

EC

0xc

OAAAAYAAJ

0zA

BAAAAMAAJ

14l

fAAAAM

AAJ

190

NAAAAYAAJ

1DUrA

AAAYAAJ

0

0.1

0.2

0.3

0.4

0.5

0.6

16thcentAlpha_Font

Aristides_DindorfAristides_Dindorf_1

BekkerBude

CambridgeEarly_Teubner

Etymologicumgamera-greekocr-training-loeb-separatistic-2011-04-25

KurkeLexicon

LittreLoeb_Wholistic

New_TeubnerOribase_Font

Oribase_Font_1Oribase_Font_2

Oribase_TestOxford

SmythSuper_Swirly

Super_Swirly2Teubner_Latin

Teubner_SansSerifTeubner_Similar

Teubner_Similar2Teubner_Slim

16thcentAlpha_FontAristides_DindorfAristides_Dindorf_1BekkerBudeCambridgeEarly_TeubnerEtymologicumgamera-greekocr-training-loeb-separatistic-2011-04-25KurkeLexiconLittreLoeb_WholisticNew_TeubnerOribase_FontOribase_Font_1Oribase_Font_2Oribase_TestOxfordSmythSuper_SwirlySuper_Swirly2Teubner_LatinTeubner_SansSerifTeubner_SimilarTeubner_Similar2Teubner_Slim

Google/ABBYY Line Splitting

Gamera’s Text Line Finding(bbox_merging)

Replaced with runlength_smearing

Two-step processing

Future Work• Combining and re-optimizing classifiers?• Assign classifier based on Latin text

– Is ‘Oxford’, ‘Clarendon’ or ‘Oxonii’ in the first pages of output?

• Align with Google’s output, and provide Google with corrected Greek

• Implement line-splitting from other OCR engines

• Discover badly OCR’d Greek in others’ output• Implement OCR correction frameworks

described here

Common Problems• Assessments of pre-processing

strategies and tools• Schemas for page description

Thanks• Colleagues in Dynamic Variorum

Editions:– Greg Crane at Perseus / Tufts– Brian Fuchs at Imperial College

• Federico Boschetti • AceNet, especially tech. support of

Sergiy Khan

Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University.

Documents

merging slide

smearing slide

twostep processing slide

postprocessing slide

mount allison university

googleperseus greek

ocrd greek

automated greek ocr