Top Banner
An Integrated System for Polytonic Greek OCR I. Generating the Data Bruce Robertson, Dept. of Classics, Mount Allison University, New Brunswick, Canada Digital Classicist Seminar, Institute of Classical Studies, London, UK, July 19, 2013
65

Digital Classicist London Seminars 2013 - Seminar 7 - Federico Boschetti & Bruce Robertson

Jun 24, 2015

Download

Education

An Integrated System For Generating And Correcting Polytonic Greek OCR
Federico Boschetti (CNR, Pisa) and Bruce Robertson (Mount Allison University, Canada)

Digital Classicist London & Institute of Classical Studies seminar 2013
Friday July 19th at 16:30, in Room S264, Senate House, Malet Street, London WC1E 7HU

In many fields, the digital books revolution provides wide and highly detailed access to pertinent texts; but this revolution has left behind scholars working with ancient Greek. While it is true that Hellenists have had digitized canonical texts for many years, these collections' relatively limited scope and restrictive licenses are increasingly at odds with recent currents in computer-based humanities research: linked data, large-scale text mining, and syntatic treebanking, to name a few. Perhaps the most important impediments to digitizing polytonic Greek have been the lack of: a high-quality optical character recognition for this script, especially under open-source licenses; and an assisted editor for polytonic Greek OCR output. In this seminar, we present a integrated system that fills these critical gap, making it possible for polytonic Greek texts to be digitized en masse.

Rigaudon OCR is a complete suite of scripts, python code and data required for producing polytonic Greek OCR. It comprises: an OCR engine based on Gamera with many features specific to the recognition of polytonic Greek and specific classifiers to identify the characters in Teubner, Teubner-sans-serif, OCT/Loeb, and Didot editions. It includes an automatic spellchecker designed to correct Greek OCR errors, and it has a process for combining existing, high-quality Latin-script OCR output with parallel Greek output, as illustrated by this papyrological text. Finally, it coordinates these processes through Sun Grid Engine scripts required to queue and parallelize these processes.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 1. An Integrated System for Polytonic Greek OCR I. Generating the Data Bruce Robertson, Dept. of Classics, Mount Allison University, New Brunswick, Canada Digital Classicist Seminar, Institute of Classical Studies, London, UK, July 19, 2013

2. I. Generating the Data A. Reason 3. Why Ancient Greek OCR? 1. Rapid digitization of Greek texts not yet in digital libraries 4. Why Ancient Greek OCR? 1. Rapid digitization of Greek texts not yet in digital libraries 2. Study of textual variants and app. crit. 5. Why Ancient Greek OCR? 1. Rapid digitization of Greek texts not yet in digital libraries 2. Study of textual variants and app. crit. 3. Text reuse analysis 6. Why Ancient Greek OCR? 1. Rapid digitization of Greek texts not yet in digital libraries 2. Study of textual variants and app. crit. 3. Text reuse analysis 4. General-purpose OCR search, like Google Books 7. Use Manual Editing? Automatic Spell- checking? Digitization Textual Variants Text Reuse OCR Search or 8. Use Manual Editing? Automatic Spell- checking? Digitization Textual Variants Text Reuse OCR Search or 9. I. Generating the Data B. Challenge 10. Example Text: John 1:1 , , . 11. Acute, Grave and Circumflex Accents , , . 12. Smooth and Rough Breathing Marks , , . 13. Iota Subscript , , . 14. Diversity of Greek Fonts in 19th C. archive.org Texts 15. Recognizing Lines 16. Recognizing Lines 17. Accurate Binarization 18. Binarization Important to Results 19. I. Generating the Data C. Resources 20. Contextless 'Greekness' Index Devised by Dr. Boschetti Based on dictionary and likely sequences of letters, etc. Named 'B-score' in these slides 21. Archive.org Provides: Thousands of volumes rendered in high- resolution (400 ppi +) colour images OCR results from ABBYY Finereader Excellent Latin-script recognition Poor Greek results Top-quality line-segmentation 22. Open-source OCR Engines Gamera Current focus of my team Tesseract Nick White has worked extensively on this to generate good results OCRopus Dr. Boschetti recently has been able to use Tesseract training sets for this engine 23. Interchange Format: HOCR

'

24. I. Generating the Data D. Method 25. HOCR Output Page Segmentation Thru HOCR Input Gamera 3.3.3 Image Recognition JP2 Input Library Greek OCR For Gamera Classifiers for Teubner Sans, Teubner Serif, Oxford (Loeb, etc.) HOCR Results at a range of binarization thresholds Parallel Process x35 Cores 410,000 word dictionary from open Perseus Greek texts Weighted Levenshtein Automatic OCR Spellchecker (x14 cores) Per-volume spellcheck table Reduction to unique Greek strings Images from Archive.org ABBYY OCR Information file ABBYY to HOCR Conversion Score table for binarization thresholds Select highest-ranking binarization page Weighted Edit Table for Classifier Replace spellchecked words ... HOCR Output Replace Latin-script output words with ones in same position from Archive's ABBYY output Does ABBYY OCR file contain Latin- script output? Rigaudon Greek OCR Process Automatic OCR Spellcheck OCR HOCR Latin / Greek Combining HOCR "Blending" Boschetti scoring Replace non-dictionary words with dict words from other binarization pages 26. Raw HOCR Production Using Gamera Plugin for Gamera OCR allows it to import high-quality line-segmentation information, compensating for Gamera's poor results in this critical function Plugin to output HOCR Wrapper function generates a range of output pages based on binarization threshold (typically 10 - 20 per page) 27. HOCR 'Blending' This step aims to gather word-by-word the 'best' results from the range of results pages for each image Selects the highest-scoring result page overall Where a Greek word in this page is not in the dictionary and another page has a dictionary word in the exact same physical location, it replaces with dictionary word 28. Automatic Spellcheck All pages in volume are reduced to a set of unique, decomposed Greek strings These are compared to dictionary using Levenshtein distances A 'weighting table', suitable for a given font, indicates which edits are preferable or allowed Result is 'light' correction, esp. of diacritics 29. Weighting Table ['replace', ur'', ur'', 1],#for lunate fonts ['replace', ur'c', ur'', 1],#for lunate fonts ['replace', ur'T', ur'', 1], ['replace', ur'r', ur'', 1], ['replace', ur'Uu', ur'', 1], ['replace', ur'Y', ur'', 1], ['replace', ur'E', ur'', 1], ['replace', ur'E', ur'', 2], ['replace', ur'Z', ur'', 1], ['replace', ur'K', ur'', 1], Automatic Spellcheck 30. Optionally injecting Greek into Original Latin HOCR Don't want to try to get excellent Greek and Latin results, esp. when ABBYY and others do better job with Latin In the case that archive.org provides Latin OCR: If Rigaudon's output word is Greek, replace archive. org's ABBYY output word with Rigaudon's 31. Reporting 32. I. Generating the Data F. Results 33. ... , ... ' ... , ... Results 34. [ [] [- ' . Results 35. I. Generating the Data G. Future 36. Multiple OCR Engines Take ABBYY data out of the process With 'cleaning' Tesseract's line-segmentation is often as good Use Nick White's general-purpose polytonic classifier and ones specifically designed for a font HOCR Results at a range of binarization thresholds Score table for binarization thresholds Select highest-ranking binarization page Boschetti scoring Replace non-dictionary words with dict words from other binarization pages TesseractGameraOCRopus Line Segmentation OCR HOCR "Blending" 37. Resources Output: http://heml.mta.ca/rigaudon Code: https://github.com/brobertson/rigaudon Further Topics HPC Computing with Grid Engine Python Flask Web Microframework Making Book Images 38. An Integrated System for Generating and Correcting Polytonic Greek OCR Bruce Robertson and Federico Boschetti Part II The Proof-reading Process Federico Boschetti [email protected] ILC-CNR of Pisa Digital Classicist Seminars London, 19 July 2013 Federico Boschetti Generating and Correcting Polytonic Greek OCR 1/ 20 39. Information Aggregation Proof-reader Web Application False positives Introduction Manual corrections on OCR output may be performed by Experts Classicists devoted to proof-reading for a long-term project Data Entry Firms Professional proof-readers not skilled in the target language(s) Crowd Sourcing Students that are learning the target language(s) Random Volunteers People with heterogeneous education and skills Federico Boschetti Generating and Correcting Polytonic Greek OCR 1/ 20 40. Information Aggregation Proof-reader Web Application False positives Introduction For this reason proof-reading tools focused on ancient languages should be centralized easy to use based on image / text comparison line by line optimized to catch attention on possible errors, distinguished by category eciently providing the most probable correction Federico Boschetti Generating and Correcting Polytonic Greek OCR 2/ 20 41. Information Aggregation Proof-reader Web Application False positives Enriched hocr les Alignment with other editions False negatives Overview 1 Information Aggregation Enriched hocr les Alignment with other editions False negatives 2 Proof-reader Web Application 3 False positives Federico Boschetti Generating and Correcting Polytonic Greek OCR 3/ 20 42. Information Aggregation Proof-reader Web Application False positives Enriched hocr les Alignment with other editions False negatives Enriched hocr les OCR output formatted in hocr microformat The hocr output produced by Rigaudon is postprocessed, in order to add information managed by the Proof-reading Web Application Multiple sources Dictionaries with and without diacritics Multiple editions of the same work (if available) Syllabic repertory Federico Boschetti Generating and Correcting Polytonic Greek OCR 3/ 20 43. Information Aggregation Proof-reader Web Application False positives Enriched hocr les Alignment with other editions False negatives Dictionaries In order to identify possible errors and provide good suggestions to correct them, the OCR output is spell-checked and the potential errors are processed step by step The spell-checker is based on dictionaries generated from Perseus text collection. An upper-case dictionary is used to evaluate if a character sequence is a word with a wrong accent or breathing mark Federico Boschetti Generating and Correcting Polytonic Greek OCR 4/ 20 44. Information Aggregation Proof-reader Web Application False positives Enriched hocr les Alignment with other editions False negatives Alignment with other editions When another edition of the same work is available, the two editions are aligned word by word applying the Needleman-Wunsch algorithm | | | | | | | | | | | | | e Federico Boschetti Generating and Correcting Polytonic Greek OCR 5/ 20 45. Information Aggregation Proof-reader Web Application False positives Enriched hocr les Alignment with other editions False negatives False negatives and the risk of digital contaminatio An example Rigaudon on the Anecdota Graeca edited by Cramer recognizes the word , which is rejected by the current spellchecker The spell-checker suggests as a correction Also the alignment with Kosters edition of the Prolegomena de comoedia suggests But the page image contains , a late form attested from Athenaeus to the Byzantine period Federico Boschetti Generating and Correcting Polytonic Greek OCR 6/ 20 46. Information Aggregation Proof-reader Web Application False positives Enriched hocr les Alignment with other editions False negatives Syllabication In order to prevent false negatives due to (rare) variants ignored by the dictionaries, the system distinguishes between random character sequences and well-formed syllabic sequences Each potential error is divided in syllables and each syllable is evaluated according to its position For example, - is a well-formed syllabic sequence: - is a valid Greek initial syllable and - is a valid nal Greek syllable Federico Boschetti Generating and Correcting Polytonic Greek OCR 7/ 20 47. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Overview 1 Information Aggregation 2 Proof-reader Web Application The web interface Cues Self-corrections 3 False positives Federico Boschetti Generating and Correcting Polytonic Greek OCR 8/ 20 48. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Centralization The proof-reader is a web application inspired by the Mozilla hocr Editor interface but employs the WikiSource collaborative philosophy Texts are stored in a central XML native database Federico Boschetti Generating and Correcting Polytonic Greek OCR 8/ 20 49. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections The Control Panel Federico Boschetti Generating and Correcting Polytonic Greek OCR 9/ 20 50. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Image / Text Pairs Federico Boschetti Generating and Correcting Polytonic Greek OCR 10/ 20 51. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Cues Wrong accents and breathing marks Attention is focused on diacritics Self-corrections Special care is necessary to avoid the risk of contaminatio Errors Random errors Federico Boschetti Generating and Correcting Polytonic Greek OCR 11/ 20 52. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Example Federico Boschetti Generating and Correcting Polytonic Greek OCR 12/ 20 53. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Self-corrections and suggestions generated by alignment In a self-correction, the reading has been substituted by the aligned word of another edition. Self corrections need three conditions: character sequence is refused by the spell-checker edit distance between the character sequence and the aligned edition is very close the character sequence is not a well-formed syllabic sequence Federico Boschetti Generating and Correcting Polytonic Greek OCR 13/ 20 54. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Example Federico Boschetti Generating and Correcting Polytonic Greek OCR 14/ 20 55. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Dynamic Dictionaries Dictionaries used by the spell-checker are dynamically rebuilt when a milestone in proof-reading is reached Enlarging the dictionaries, rare variants are acquired and used to spell-check the next works Federico Boschetti Generating and Correcting Polytonic Greek OCR 15/ 20 56. Information Aggregation Proof-reader Web Application False positives Overview 1 Information Aggregation 2 Proof-reader Web Application 3 False positives Federico Boschetti Generating and Correcting Polytonic Greek OCR 16/ 20 57. Information Aggregation Proof-reader Web Application False positives False positives are deceitful By denition, false positives pass the spell-checking Specially if they are graphically similar to the correct word, such as and in Greek or m and ni in Latin, they are dicult to be seen, in particular by proof-readers not skilled in the target language(s) Federico Boschetti Generating and Correcting Polytonic Greek OCR 16/ 20 58. Information Aggregation Proof-reader Web Application False positives Example Federico Boschetti Generating and Correcting Polytonic Greek OCR 17/ 20 59. Information Aggregation Proof-reader Web Application False positives Example Federico Boschetti Generating and Correcting Polytonic Greek OCR 17/ 20 60. Information Aggregation Proof-reader Web Application False positives Semantic Distance Semantic distance is calculated along the nodes of WordNets hierarchy, i.e. along the chain of hyponyms / hypernyms, in order to reach co-hyponyms Dierent translations of the same concepts (e.g. vis in Latin and ecacia in Italian or ecacy in English) have semantic distance equal to zero Semantically unrelated words (e.g. vinum in Latin and ecacia in Italian) have a large semantic distance Federico Boschetti Generating and Correcting Polytonic Greek OCR 18/ 20 61. Information Aggregation Proof-reader Web Application False positives AncientWordNet Synsets of AncientGreekWordNet and LatinWordNet have been extracted from bilingual dictionaries They are aligned to modern languages such as English, Italian, etc. Federico Boschetti Generating and Correcting Polytonic Greek OCR 19/ 20 62. Information Aggregation Proof-reader Web Application False positives Conclusion The proof-reading Web Application puts together the main features of individual and collaborative proof-reading tools currently available The entire work-ow is circular: Training OCR - Performing OCR - Spell-checking OCR - Correcting OCR - Enlarging dictionaries - Retraining OCR Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20 63. Information Aggregation Proof-reader Web Application False positives Thank you for your attention Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20 64. Information Aggregation Proof-reader Web Application False positives References S. Feng, R. Manmatha: A Hierarchical, HMM-based Automatic Evaluation of OCR Accuracy for a Digital Library of Books. JCDL 2006, 109118 (2006) W.B. Lund, E.K. Ringger: Improving Optical Character Recognition through Ecient Multiple System Alignment, JCDL (2009) M. Reynaert: Non-interactive OCR Post-correction for Giga-Scale Digitization Projects. A. Gelbukh (ed.): CICLing 2008, LNCS 4919, 617630 (2008) M. Reynaert: All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation. 6th International Conference on Language Resources and Evaluation 2008, 18671872 (2008) C. Ringlstetter, K. Schulz, S. Mihov, K. Louka: The same is not the same - postcorrection of alphabet confusion errors in mixed-alphabet OCR recognition. 8th International Conference on Document Analysis and Recognition, 1, 406410 (2005) M. Spencer, C. Howe: Collating texts using progressive multiple alignment. Computer and the Humanities, 37, 1, 97109 (2003) G. Stewart, G. Crane, A. Babeu: A New Generation of Textual Corpora. JCDL 2007, 356365 (2007) L. Zhuang, X. Zhu: An OCR Post-processing Approach Based on Multi-knowledge. 9th International Conference on Knowledge-Based Intelligent Information and Engineering Systems, 346352 (2005) Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20