IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Lexica in OCR and IR Evaluation for Bulgarian, Czech, Dutch, English, French, German, Polish, Slovenian, Spanish Jesse de Does
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Lexica in OCR and IR
Evaluation for
Bulgarian, Czech, Dutch, English, French, German, Polish, Slovenian, Spanish Jesse de Does
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
2
Contents OCR evaluation
– Use of lexica in OCR– Evaluation Method– (non-final) Results
IR evaluation– Use of lexica in IR– Evaluation Method– (Very preliminary) results
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
3date footertext 3
Use of lexica in OCR! This is not about postcorrection, but about what happens during OCR
Using “Finereader Engine External Dictionary Interface”
Functionality:Any procedure that prunes a set of candidates and assigns weights can be implemented in this waySuch a procedure need not be limited to the use static of word listsPermits dynamic implementations (spelling variation rules, morphology, …)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
4
Finereader SDK external dictionaries SDK users have to implement a COM interface to prune a set of “Fuzzy Words”
eerdecc cc
eerstecc f c o o
External dictionary prunes this to the linguistically possible ones(In this case: { eerste, eerde})
Fuzzy Word: set of character recognition candidates for each position in a word
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
5
Finereader SDK external dictionaries
eerdecc cc
eerstecc f c o o
Of cause a lot of things may go wrong in this simple scenarioLexicon may be too small (you will never have all spelling variations, compounds, …)Lexicon may include typical OCR errors (eu, cn, ….)! The Fuzzy word may be too restricted (or of course too comprehensive)
{eerste, eerde}x
____
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
6
OCR Evaluation
Measure evaluation of Finereader SDK 10 with default included dictionary Finereader SDK 10 with both default dictionary AND use of historical lexicon
Main performance indicator: word recall: after alignment, how many of the words in the ground truth have a (case-insensitive) match in the OCR. Errors on punctuation not penalized.
Specific evaluation tool (only word accuracy)– Workaround for region segmentation problems– Display specific information about dictionary coverage, information about
performance on dictionary words, false friends ….
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
8
Dictionary “cleaning”
Dictionary hallucinations: Many in-dictionary errors (“false friends”) Many errors on short words
Dictionary cleaning procedures: Remove false friends (words related by frequent OCR substitution to much more frequent
words) Remove infrequent short words (even if correct)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
12
OCR→GT Freq.
п→и 968
ж→ѫ 825
н→и 732
н→п 579
е→с 441
п→н 378
и→н 356
ь→ъ 270
ъ→ь 256
г→т 247
и→п 242
ш→ні 218
д→л 114
OCR→GT Freq.
п→и 733
н→и 599
н→п 463
п→н 354
ь→ъ 330
ж→ѫ 283
и→н 249
ш→ні 220
ъ→ь 217
и→п 205
е→с 200
г→т 185
е→ѣ 165
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
14
1. Czech Co jest konstituce?, čili, Krátký, prostonárodní wýklad hlawnějších zásad konstitucí ewropejských,
1848 Ferina Lišák z Kuliferdy a na Klukově, čili, Kratičká historye zlopověstných kousků starého Reinecke,
1848 Homerowa Iliada, 1802 Na den narození neimocněišího, a neijasněišího cysare rímského, téz dědičného rakauského a krále
ceského, Frantiska II., w Praze 12. den mesyce Unora, léta 1805, 1805 Plody sborů učenců řeči českoslowanské prešporského, 1836 Rozprawy o gmenách, počátkách i starožitnostech národu Slawského a geho kmeni /, 1830 Sokol, 1872 Základowé pitwy (Anatomie), čili, Soustawnj rozbor a popis těla lidského a gednotliwých geho
částek, 1840
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
16
2.Dutch 18th and 19th century books, newspapers, parliamentary papers …….. Provinciale Overijsselsche en Zwolsche courant : staats-, handels-, nieuws- en advertentieblad, 1852-
1852 Rechtsgeleerd advis in de zaak van den gewezen stadhouder, en over deszelfs schryven aan de
gouverneurs van de Oost- en West-Indische bezittingen van den staat [...]. Ingelevert [...] op den 7 january 1796. / By B. Voorda et al, 1796-1796
Verhaal van het levensgevaar, waar in zig drie Rotterdamsche burgers [...] bevonden hebben, te Utrecht, 1784-1784
Vrijmoedige aanmerkingen, over de uitsluiting van allen die door publieke armkassen bedeeld worden, als stemgerechtigden [...] bij eene oproeping van het Nederlandsche volk tot eene Nationaale Conventie, 1795-1795
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
19
English
Standard Finereader language: OldEnglish 15th-19th century material 2 sets:
– One general set, 15th-19th century– One 17th century-specific set
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
20
General set with various choices of dictionary – no improvement!
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
21
More distinct improvement on 17th century set with special dictionary compiled from OED quotations dated 1580-1720
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
22
French Standard Finereader language: OldFrench
17th century books
Conduite du jugement naturel où tous les bons esprits de l'un et l'autre sexe pourront facilement puiser la pureté de la science, par M. Jacques Forton, sieur de S. Ange,..., 1653
Dissertation de la philosophie en général, 1668 La Dialectique du sieur de Launay, contenant l'art de raisonner juste sur toute sorte de matières...,
1673 Lettre de M. Gadroys à M. de La Grange Trianon,... pour servir de réponse à celle que M. de
Castelet a écrite contre les raisons de M. Descartes touchant le flux et le reflux de la mer. - Seconde lettre de M. Gadroys... [au même, sur le même sujet.], 1677
Traitez de métaphysique démontrée selon la méthode des géomètres. [Par le sieur de La Coudraye.], 1693
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
23
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24
German Standard Finereader language: OldGerman
Das Buch des heyligen Römischen Reichs unnderhalltunge, 1501 Die Poesie ihr Wesen und ihre Formen mit Grundzügen der vergleichenden Literaturgeschichte, 1884 Echo Deß Hochzeitlichen Te Deum Laudamus, 1722 Ergebnisse der Erhebungen über die Beschäftigung gewerblicher Arbeiter an Sonn- und Festtagen, Bd.:1, Gruppe
I bis VII der Gewerbestatistik, Berlin, 1887, 1887 Quedlinburgisches Kreis-Tags-Memorial, 1673 Von der Regierung der Kirche und den unterschiedlichen Würden der Geistlichkeit *(full title in comments), 1779 Warhaffter und grundlicher Bericht uß was Ursachen Martinus du Voysin (zu Basel verburgerter Krämer) inn der
Statt Surseew im Aargöw, ..., den 13. Tag Octobris deß 1608. Jars erstlich enthauptet, und volgends verbrennt worden, 1609
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
25
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
26
Polish Adwersaria, albo terminata sprawy wojennej, która się toczyła w wołoskiej ziemi z tureckim cesarzem, 1621 Chorągiew Sarmacka w Wołoszech, to jest pospolite ruszenie i szczęśliwy powrót Polaków z Wołoch w roku 1621,
1621 Diariusz wiadomości od wyjazdu króla z Wilna do Smoleńska, 1610 Discurs o cenie pieniedzy teraznieyszey y o niektorych skutkach iey…, 1632 Nowe Ateny, albo Akademia wszelkiey scyencyi pełna, na różne tytuły iak na classes podzielona, mądrym dla
memoryału, idiotom dla nauki, politykom dla praktyki, melancholikom dla rozrywki erygowana ... . Część 3 albo Supplement., 1746
Pasja żołnierzy obojga narodów w stolicy moskiewskiej krótko opisana, 1613 Powodzenia niebezpiecznego ale szczęśliwego wojska j. k. m. w Multanach opisanie, 1601 Relacja chwalebnej ekspedycji Jana Kazimierza, króla polskiego i szwedzkiego, 1650 Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej, 1634 Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej_BW, 1634 Żałosne opisanie upadku króla hiszpańskiego na morzu i na lądzie, 1589
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
27
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
28
Slovene Genovefa, 1841 Gosp. Krištofa Šmida korarja avgustanskiga, zgodBe S. Pisma za mlade ljud..., 1850 Kmetijske in rokodelske novice, 1844 Kratkozhasne uganke, 1788 Kuharske Bukve, 1799 Marianske Kempensar, ali Dvoje bukuvze, 1769 Novice kmetijskih, rokodelnih in narodskih reči, 1851 Sgodbe svetiga pisma za mlade ljudi, 1830 Ta male katechismus, 1768 Vezhna pratika od gospodarstva, 1789 Zerkviza na skali, 1855
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
29
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
30
Spanish
Carta athenagorica, 1690Commentarios reales, 1609El Parnasso español, 1648Obras de Garcilasso de la Vega con las anotaciones por el Mtro. Francisco Sánchez Brocense, 1612Obras de Lope de Vega, 1604Vida de Lazarillo de Tormes, 1652
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
31
Results
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
32
Summary
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
33
Evaluation of “IR”
Main question:Are we able to retrieve historical variants of words?
Practical evaluation criterion:Measure accuracy of modern lemma assignment
(If we can do this, good retrieval is possible)
More complete evaluation to follow soon – all partners are finishing the work
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
34
Evaluation method Each language partner annotates ~10.000 tokens of Ground Truth with modern
lemma and/or equivalent word form We measure performance of:
– Lemmatization with a modern lexicon– Lemmatization with a modern lexicon and spelling variation patterns– Lemmatization with a historical lexicon, a modern lexicon and spelling variation patterns– No context information is used
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
35
EnglishUsing OED IR lexicon and very restricted set of spelling variation patterns
Considered tokens: 9409.8994 had a correct lemma (recall 0,956)
Total correct suggestions: 8994 Average rank of correct lemma: 1,086280total possible lemmata: 23859None match at all: 265Matched With Patterns: 1330Exact Match: 7814
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
36
SpanishUsing Apertium modern Spanish Lexicon, IMPACT historical spanish IR lexicon and 9298 token consideredWith only modern lexicon and patterns 7473 with at least one correct lemma and 1825 without (recall 0,80)
Average rank of correct lemma: 1,1, Total suggestions 9699No match at all: 991
Modern Exact: 7471; Modern With Patterns: 836
With historical lexicon, modern lexicon and patterns:8864 with at least one correct lemma and 434 without (recall 0,926)
Average rank of correct lemma: 1,16, Total suggestions 12417 ModernWithPatterns: 186 No match at all: 542 Historical Lexicon Exact match: 8265 ModernExact: 305