IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computer Lexica in OCR and Retrieval Katrien Depuydt, Jesse de Does (Instituut voor Nederlandse Lexicologie, Leiden)
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Computer Lexica in OCR and Retrieval
Katrien Depuydt, Jesse de Does (Instituut voor Nederlandse Lexicologie, Leiden)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
4 March 2009 presentation The Hague 2
Can we handle ‘de wereld’ (‘the world’)’?
werreid
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 3
OCR:Abbyy Finereader SDK with built in standard Dutch dictionary
OCR:Abbyy Finereader SDK combining built in modernDutch dictionary with IMPACT external historical lexicon of Dutch:
werreld
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010
5
The long s problem: An example ….
OCR at start of project
A. De eerde was de gevaarlykflti om de verlei¬ding aan 't Hof; de tweede de ftillie en veiligde;de derde de zwaarde, daar hy byna drie millioenenharde en onbefchaafde Menfchen beftieren moest.
.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010
6
The long s problem: An example ….
OCR at start of project Results April 2010
A. De eerde was de gevaarlykflti om de verlei¬ding aan 't Hof; de tweede de ftillie en veiligde;de derde de zwaarde, daar hy byna drie millioenenharde en onbefchaafde Menfchen beftieren moest.
A. De eerste was de gevaarlykste om de verlei-ding aan 't Hof; de tweede de stilste en veiligste;de derde de zwaarste, daar hy byna drie millioenenharde en onbeschaafde Menschen bestieren moest.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010
7
The long s problem: An example ….
OCR at start of project Results April 2010
A. De eerde was de gevaarlykflti om de verlei¬ding aan 't Hof; de tweede de ftillie en veiligde;de derde de zwaarde, daar hy byna drie millioenenharde en onbefchaafde Menfchen beftieren moest.
A. De eerste was de gevaarlykste om de verlei-ding aan 't Hof; de tweede de stilste en veiligste;de derde de zwaarste, daar hy byna drie millioenenharde en onbeschaafde Menschen bestieren moest.
Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and postcorrect it afterwards with the lexicon.
In this way we keep it from turning to “eerde” (earth) instead of “eerste” (first)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 8
Overview
What is a computer lexicon
Lexica in IMPACT
Tools for lexicon building and applying lexica
Some results
Searching Demonstration
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 9
What is a computer lexicon?
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 10
Computer lexicon vs electronic dictionary (1)
An electronic dictionary is: Digitised full text (no pictures) For human use Ideally: searchable with explicitely coded material (XML), such as a lemma, part of speech (PoS), meaning, quotes etc. Examples: OED online, WNT online
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 11
Dictionary XML (example)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 13
Computer Lexicon vs Electronic Dictionary (2)
A computer lexicon is: Always in a structured digital format (XML, relational database) Main purpose: computer application Explicitely coded information (e.g. lemma wereld, part of speech noun, morphology werelden, werelds … , syntax)
Examples of use:
Linguistic enrichment of text material ‘Advanced’ searching (words with all spelling variant and inflections) Automatic summarization, keyword extraction…
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 15
Lexica in IMPACT
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 16
The OCR lexiconAn OCR lexicon is
A checked list of words in a language Based on a corpus (collection) of dated texts (selection!) Preferably with frequency information Preferably from the same time period or of the same text type as the texts you wish to digitize
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 18
The IR lexicon IR lexicon: most
important information categoriesword forms (lists of words) +
- frequency information- quotes
(dated sources) from corpora or electronic dictionaries- MODERN LEMMA (// entrance dictionary) linked to spelling variants and inflected forms of the
same wordT
he modern lemma is used for searching in textsS
tandard use in corpus linguistics and modern historical lexicography
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 19
<?xml version='1.0'?><!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'><lexicon><lexical_entry><lemma_id>219490</lemma_id><modern_lemma>aantuilen</modern_lemma><gloss></gloss><POS>VRB</POS><ne_label></ne_label><language_id></language_id><portmanteau_lemma_id></portmanteau_lemma_id>
<wordform><form_representation><wordform_id>850026</wordform_id><written_form>tuyld</written_form><attestation><id>92141</id><token_id></token_id><quote>Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an, Vermits een Vrou niet op een Vrou verlieven kan,</quote><derivation_id>0</derivation_id><document_id>204</document_id><start_pos>119</start_pos><end_pos>124</end_pos></attestation></form_representation></wordform>
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 20
Tools for lexicon building and application of lexica
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Supervised rule (pattern) induction from pairs (“modern” word, historical word), yielding patterns like aa/ae, s/z, ….
Pattern weights are computed from example material
Additional approaches possible, eg. : Use of aligned data (parallel historical text and modern version)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010
26
Lemmatization Reduction of historical word forms to modern lemma Historical word standard (“modern”) spelling lemma form (pattern matching) (lemmatizer)
Dystels (1) distels (2) distel
When we have a perfect or near-perfect modern full form lexicon, the second step is simply lexicon lookup.
But: 1) We will not have full form information for many lemmata
(especially the historical ones)2) Even lemmata present in modern language may have historical
inflected forms different from the present-day paradigm
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010
27
Lemmatization and reverse lemmatizationWe also need a lemmatization process for these situations A typical lemmatizer assigns some standard form (infinitive,
nominative, stem) to inflected forms. Usually based on patterns relating the inflected form to the standard form.
But: Matching these patterns can be hard to combine with matching
both spelling variation patterns and OCR errors (bok/bokken/bokkeu)
We adopt the solution of actually expanding the “hypothetical modern full form lexicon” containing the most plausible possible paradigmatic expansions of lemmata
This construction is carried out by means of a statistical reverse lemmatizer
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010
28
Attestation From hypothetical (non-witnessed) lexicon content to attested word forms in
“real” text Automatic selection of candidate attestations Manual work: verification and correction
Two approaches Dictionary based (INL): Woordenboek der Nederlandsche Taal Corpus based (LMU, INL): Dutch DBNL corpus
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010
29
IMPACT Dictionary Attestation Tool
work
• We are working on what works.
• Depart from me, ye that worke iniquity.
• She worcketh knittinge of stockings.
headword
Quotations
variants
Task Find the variants of a headword as they occur in the quotations
Lexicon building at work: Verifying attestations in historical dictionaries
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010
30
IMPACT Dictionary Attestation Tool
Automatically (preprocessing)
• match literally e.g: work work, Work
• match using existing lexica and lists e.g: work works, worked, wrought
• approximate matching e.g: work worke
By hand (using the tool)
• correct automatic mismatches e.g: works words, worms
• find missed matches e.g: work worketh, wrowght
Task Find the variants of a headword as they occur in the quotations
Electronic
historical
dictionary Database
with lemmata
and quotatioms
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010
31
IMPACT Attestation ToolTool
Lemma headword
Quotations
Sorted by uncertainty
Up-to-date overview of what is done and needs to be done
Done by this user so far
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010
32
IMPACT Lexicon Tool
Automatically (preprocessing = apply lemmatizer)
• match literally e.g: work work, Work
• match using existing lexica and lists e.g: work works, worked, wrought
• matching using spelling variation module e.g: uiterlijk uyterlick
By hand (using the tool)
• assign correct lemma e.g: was (N) zijn (V)
• group tokens belonging together e.g: konings zoon koningszoon
• select attestations
Task Find and verify attestations in a historical corpus
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010
34
General vocabulary vs. Named entitiesT
ools for lexicon building described so far: applicable to general lexiconT
ools for NE recognition, classification and variant matching
- library requirement- distinguish general vocabulary from NE’s- avoid unpleasant mixups like Abimelech apemelk! (b/p; i/e; e/0; k/ch)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010
35
Improvement of state of the art / innovation
We use existing computational linguistic approaches, but figure out how to apply them to historical language
We develop a workflow to deal with the problems posed by historical language, figuring out how all pieces fit together Data selection and acquisition Manual work Computational linguistics tools
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
languages in IMPACTD
utch, German, English, Spanish, FrenchP
olish, Czech, Slovene and Bulgarian
-Cross language perspective paper
-Parallel OCR and IR experiments
-GT datasets
-Language tools: language independent
-Except from 3 core languages: proof of concept lexica
IMPACT <Demo Day BL, 12 July 2011> 36
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR evaluation results(preliminary!)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
1. Czech Co jest konstituce?, čili, Krátký, prostonárodní wýklad hlawnějších
zásad konstitucí ewropejských, 1848 Ferina Lišák z Kuliferdy a na Klukově, čili, Kratičká historye
zlopověstných kousků starého Reinecke, 1848 Homerowa Iliada, 1802 Na den narození neimocněišího, a neijasněišího cysare rímského,
téz dědičného rakauského a krále ceského, Frantiska II., w Praze 12. den mesyce Unora, léta 1805, 1805
Plody sborů učenců řeči českoslowanské prešporského, 1836 Rozprawy o gmenách, počátkách i starožitnostech národu
Slawského a geho kmeni /, 1830 Sokol, 1872 Základowé pitwy (Anatomie), čili, Soustawnj rozbor a popis těla
lidského a gednotliwých geho částek, 1840
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
2.Dutch1
8th and 19th century books, newspapers, parliamentary papers
Provinciale Overijsselsche en Zwolsche courant : staats-, handels-, nieuws- en advertentieblad, 1852-1852
Rechtsgeleerd advis in de zaak van den gewezen stadhouder, en over deszelfs schryven aan de gouverneurs van de Oost- en West-Indische bezittingen van den staat [...]. Ingelevert [...] op den 7 january 1796. / By B. Voorda et al, 1796-1796
Verhaal van het levensgevaar, waar in zig drie Rotterdamsche burgers [...] bevonden hebben, te Utrecht, 1784-1784
Vrijmoedige aanmerkingen, over de uitsluiting van allen die door publieke armkassen bedeeld worden, als stemgerechtigden [...] bij eene oproeping van het Nederlandsche volk tot eene Nationaale Conventie, 1795-1795
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
English1
6th-19th century materialS
ources for lexicon building: OED, ECCO
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
French1
7th century books
Conduite du jugement naturel où tous les bons esprits de l'un et l'autre sexe pourront facilement puiser la pureté de la science, par M. Jacques Forton, sieur de S. Ange,..., 1653
Dissertation de la philosophie en général, 1668
La Dialectique du sieur de Launay, contenant l'art de raisonner juste sur toute sorte de matières..., 1673
Lettre de M. Gadroys à M. de La Grange Trianon,... pour servir de réponse à celle que M. de Castelet a écrite contre les raisons de M. Descartes touchant le flux et le reflux de la mer. - Seconde lettre de M. Gadroys... [au même, sur le même sujet.], 1677
Traitez de métaphysique démontrée selon la méthode des géomètres. [Par le sieur de La Coudraye.], 1693
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
German Das Buch des heyligen Römischen Reichs unnderhalltunge, 1501 Die Poesie ihr Wesen und ihre Formen mit Grundzügen der vergleichenden
Literaturgeschichte, 1884 Echo Deß Hochzeitlichen Te Deum Laudamus, 1722 Ergebnisse der Erhebungen über die Beschäftigung gewerblicher Arbeiter an
Sonn- und Festtagen, Bd.:1, Gruppe I bis VII der Gewerbestatistik, Berlin, 1887, 1887
Quedlinburgisches Kreis-Tags-Memorial, 1673 Von der Regierung der Kirche und den unterschiedlichen Würden der
Geistlichkeit *(full title in comments), 1779 Warhaffter und grundlicher Bericht uß was Ursachen Martinus du Voysin (zu
Basel verburgerter Krämer) inn der Statt Surseew im Aargöw, ..., den 13. Tag Octobris deß 1608. Jars erstlich enthauptet, und volgends verbrennt worden, 1609
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Polish Adwersaria, albo terminata sprawy wojennej, która się toczyła w wołoskiej ziemi z
tureckim cesarzem, 1621 Chorągiew Sarmacka w Wołoszech, to jest pospolite ruszenie i szczęśliwy powrót
Polaków z Wołoch w roku 1621, 1621 Diariusz wiadomości od wyjazdu króla z Wilna do Smoleńska, 1610 Discurs o cenie pieniedzy teraznieyszey y o niektorych skutkach iey…, 1632 Nowe Ateny, albo Akademia wszelkiey scyencyi pełna, na różne tytuły iak na classes
podzielona, mądrym dla memoryału, idiotom dla nauki, politykom dla praktyki, melancholikom dla rozrywki erygowana ... . Część 3 albo Supplement., 1746
Pasja żołnierzy obojga narodów w stolicy moskiewskiej krótko opisana, 1613 Powodzenia niebezpiecznego ale szczęśliwego wojska j. k. m. w Multanach opisanie,
1601 Relacja chwalebnej ekspedycji Jana Kazimierza, króla polskiego i szwedzkiego, 1650 Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej,
1634 Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony
Polskiej_BW, 1634 Żałosne opisanie upadku króla hiszpańskiego na morzu i na lądzie, 1589
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Slovene Genovefa, 1841 Gosp. Krištofa Šmida korarja avgustanskiga, zgodBe S. Pisma za
mlade ljud..., 1850 Kmetijske in rokodelske novice, 1844 Kratkozhasne uganke, 1788 Kuharske Bukve, 1799 Marianske Kempensar, ali Dvoje bukuvze, 1769 Novice kmetijskih, rokodelnih in narodskih reči, 1851 Sgodbe svetiga pisma za mlade ljudi, 1830 Ta male katechismus, 1768 Vezhna pratika od gospodarstva, 1789 Zerkviza na skali, 1855
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 53
Retrieval demonstrator
Indexing and retrieval library (java) implemented on the lucene search engine
Lexicon in MySQL database
OCR with Finereader SDK and external dictionary interface of about 2000 images of the Dutch Ground Truth selection
Page XML output [in framework]
NE tagging
Indexing and retrieval while using lexicon and NE tagging
53
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.