Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University
Dec 19, 2015
Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin
Collection
Bruce Robertson, Mount Allison University
ἀλήθειαtruth
Ἀλήθεια
• ‘Breathing’ marks on vowels at beginning of a word
• Accents possible on all vowels
Greek OCR With Gamera• Dalitz and Brandt provide an
experimental framework– I added splitting, grouping, sql output, etc.
• Teams of undergraduates making multiple classifiers– Based on families of fonts– Comparing strategies of composite
characters, splitting, etc.– Must also train for Latin scripts used
• Not yet working on post-processing
Systematic Approach to Automated Greek OCR
• Remove the curator from the loop – especially important for journals, monographs, etc. – Assign classifier by computation means
• Using:– Federico Boschetti’s ground-truth-less
Greek text evaluator– Atlantic Computational Excellence
Network, Atlantic Canada’s parallel computing network
Process• 160 Greek-heavy texts chosen• Of these, random samples of 10
pages were taken• Each was processed with each of the
20 classifiers made this summer• The result were evaluated and given
a ‘Boschetti score’ from 0 – 1
0EQ
OAAAAYAAJ
0Oku
AAAAQAAJ
0qB
EAAAAMAAJ
0w8A
NA2-pu
EC
0xc
OAAAAYAAJ
0zA
BAAAAMAAJ
14l
fAAAAM
AAJ
190
NAAAAYAAJ
1DUrA
AAAYAAJ
0
0.1
0.2
0.3
0.4
0.5
0.6
16thcentAlpha_Font
Aristides_DindorfAristides_Dindorf_1
BekkerBude
CambridgeEarly_Teubner
Etymologicumgamera-greekocr-training-loeb-separatistic-2011-04-25
KurkeLexicon
LittreLoeb_Wholistic
New_TeubnerOribase_Font
Oribase_Font_1Oribase_Font_2
Oribase_TestOxford
SmythSuper_Swirly
Super_Swirly2Teubner_Latin
Teubner_SansSerifTeubner_Similar
Teubner_Similar2Teubner_Slim
16thcentAlpha_FontAristides_DindorfAristides_Dindorf_1BekkerBudeCambridgeEarly_TeubnerEtymologicumgamera-greekocr-training-loeb-separatistic-2011-04-25KurkeLexiconLittreLoeb_WholisticNew_TeubnerOribase_FontOribase_Font_1Oribase_Font_2Oribase_TestOxfordSmythSuper_SwirlySuper_Swirly2Teubner_LatinTeubner_SansSerifTeubner_SimilarTeubner_Similar2Teubner_Slim
Future Work• Combining and re-optimizing classifiers?• Assign classifier based on Latin text
– Is ‘Oxford’, ‘Clarendon’ or ‘Oxonii’ in the first pages of output?
• Align with Google’s output, and provide Google with corrected Greek
• Implement line-splitting from other OCR engines
• Discover badly OCR’d Greek in others’ output• Implement OCR correction frameworks
described here