IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computer Lexica in OCR and Retrieval Katrien Depuydt (Instituut voor Nederlandse Lexicologie, Leiden)
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Computer Lexica in OCR and Retrieval
Katrien Depuydt (Instituut voor Nederlandse Lexicologie, Leiden)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 2
Overview
What is a computer lexicon
Lexica in IMPACT
Tools for lexicon building and applying lexica
Some results
Searching Demonstration
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 3
What is a computer lexicon?
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 4
Computer lexicon vs electronic dictionary (1)
An electronic dictionary is: Digitised full text (no pictures) For human use Ideally: searchable with explicitely coded material (XML), such as a lemma, part of speech (PoS), meaning, quotes etc. Examples: OED online, WNT online
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 5
Dictionary XML (example)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 7
Computer Lexicon vs Electronic Dictionary (2)
A computer lexicon is: Always in a structured digital format (XML, relational database) Main purpose: computer application Explicitely coded information (e.g. lemma, part of speech, morphology, syntax)
Examples of use:
Linguistic enrichment of text material ‘Advanced’ searching (words with all spelling variant and inflections) Automatic summarization, keyword extraction…
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 9
Lexica in IMPACT
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 10
The OCR lexiconAn OCR lexicon is
A checked list of words in a language Based on a corpus (collection) of dated texts (selection!) Preferably with frequency information Preferably from the same time period or of the same text type as the texts you wish to digitize
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 12
The IR lexicon IR lexicon: most
important information categoriesword forms (lists of words) +
- frequency information
- quotes (dated sources) from corpora or electronic dictionaries
- MODERN LEMMA (// entrance dictionary) linked to spelling variants and inflected forms of the same word
The modern lemma is used for searching in texts
Standard use in corpus linguistics and modern historical lexicography
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 13
<?xml version='1.0'?><!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'><lexicon><lexical_entry><lemma_id>219490</lemma_id><modern_lemma>aantuilen</modern_lemma><gloss></gloss><POS>VRB</POS><ne_label></ne_label><language_id></language_id><portmanteau_lemma_id></portmanteau_lemma_id>
<wordform><form_representation><wordform_id>850026</wordform_id><written_form>tuyld</written_form><attestation><id>92141</id><token_id></token_id><quote>Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an, Vermits een Vrou niet op een Vrou verlieven kan,</quote><derivation_id>0</derivation_id><document_id>204</document_id><start_pos>119</start_pos><end_pos>124</end_pos></attestation></form_representation></wordform>
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 14
Tools for lexicon building and application of lexica
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
(a number are predictable with patterns, others need to be taken from a lexicon )
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Neil Fitzgerald, 7th July 2011 16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 17
Computer lexica
For OCR and OCR post correction Improving searchability of historic text material by building
a lexicon with variants by using a modern lemma as a search entry
Tools for lexicon building Tools for application of lexicon in search engines Lexicon cookbook Guidelines and tools to use the lexica in OCR
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 18
Tools (more specific)- Lexicon building from corpus material and
dictionaries - Use of lexica in search engines
- Tool to extract spelling variation patterns from historical material
- Tool to relate previously unrecognised spelling variations to their standard form
- Tool to deduct previously unrecognised inflected forms to their basic form
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 19
Ordinary words vs Names (NEs)
Tools for the automatic recognition, classification and finding of variant names Wish of the libraries Separate regular vocabulary from names Reduce unpleasant results:
Abimelech apemelk! (b/p; i/e; e/0; k/ch) (apemelk means monkeymilk..)
NE lexica
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 20
A number of results for Dutch and German
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 21
Ground truth data: DutchType and genre # words
Gold Standard Book 300k
Random Set Books 340k
Random Set Staten Generaal (Legal Papers)
2.5M
Gold Standard Staten Generaal 500k
Gold Standard Newspapers 1 3.4M
Gold Standard Newspapers 2 170k
Random Set Newspapers 3.2M
total 13.1M
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 22
Lexicon coverage (1: ground truth books)
Type coverage Token coverage
Modern lexicon (e-Lex) 46% 76%
Core general lexicon 56% 84%
1 + 2 63% 89%
Expansion with corpus material
78% 95%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 23
Lexicon coverage (2: GT newspapers 18th-19th C.)
Type coverage Token coverage
Modern lexicon (e-Lex) 40% 83%
Core general lexicon 41% 84%
1 + 2 51% 89%
Expansion with corpus material
62% 95%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 24
Lexicon coverage (3: GT Staten Generaal 19e C.)
Type coverage Token coverage
Modern lexicon (e-Lex) 51% 89%
Core general lexicon 47% 88%
1 + 2 58% 93%
Expansion with corpus material
68% 97%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 25
Lexicon coverage (4: GT Staten Generaal 20e C.)
Type coverage Token coverage
Modern lexicon (e-Lex) 70% 93%
Core general lexicon 66% 93%
1 + 2 76% 96%
Expansion with corpus material
81% 98%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 26
Lexicon coverage (5: Genesis, 1637 bible)
Type coverage Token coverage
Modern lexicon (e-Lex) 31% 61%
Core lexicon 62% 83%
1 + 2 65% 89%
Expansion with corpus material
87% 98.6%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 27
Lexicon coverage (6: P.C. Hooft, histories)
Type coverage Token coverage
Modern lexicon (e-Lex) 26% 67%
Core lexicon 47% 88%
1 + 2 50% 90%
Expansion with corpus material
58% 96%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Translation of corpus frequencies to weights 0-100 Broken words, case-sensitivity, …Problem with long ‘s’ (work around)
Lexicon DataIMPACT OCR-lexicon for DutchFinereader internal lexicon
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 29
OCR results: word recognition rateDataset With ABBYY internal Dutch
lexiconWith IMPACT lexicon for Dutch (case hyphenation)
With IMPACT lexicon for Dutch (case hyphenation) + long S problem)
DPO35 88.8% 90.9% 93,5 %
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 30
An example:
OCR at the beginning of the project: Results:
A. De eerde was de gevaarlykflti om de verlei¬ding aan 't Hof; de tweede de ftillie en veiligde;de derde de zwaarde, daar hy byna drie millioenenharde en onbefchaafde Menfchen beftieren moest.
A. De eerste was de gevaarlykste om de verlei-ding aan 't Hof; de tweede de stilste en veiligste;de derde de zwaarste, daar hy byna drie millioenenharde en onbeschaafde Menschen bestieren moest.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 31
Dictionary16th century
No. of word errors
Reduction of error rate
18th century
No. of word errors
Reduction of error rate
19th century
No. of word errors
Reduction of error rate
No Lexicon 1306 - 827 - 2074 -
Optimal Lexicon 756 42% 395 52% 612 70%
Modern Lexicon 1096 16% 501 39% 888 57%
W.Historical Lexicon 938 28% 481 42% 856 59%
Modern + Virtual H.L. 1011 25% 480 42% 849 59%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Languages in IMPACTD
utch, German, English, Spanish, FrenchP
olish, Czech, Slovene and Bulgarian
-Cross language perspective paper
-Parallel OCR and IR experiments
-GT datasets
-Language tools: language independent
-Except from 3 core languages: proof of concept lexica
IMPACT <Demo Day BL, 12 July 2011> 32
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
English in IMPACTL
exicon building using OED– OCR lexicon from quotations full text, possibly supplemented with corpus material– IR lexicon from headword variants in quotations (small demo)
Named Entity Recognition on newspaper material
– NE lexicon– Gold standard corpus NE recognition (CONLL)
(Named Entity Recognition Task Definition, by: N. Chinchor, E. Brown, L. Ferro, and P. Robinson , Nr. Version 1.4 (1999) )
PER, LOC, ORGR
esearch into the possible benefits from exclusion of modern words from the OCR lexicon
IMPACT <Demo Day BL, 12 July 2011> 33
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 34
An indemnity shall be granted to the surfer….
… bikini …
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT <Demo Day BL, 12 July 2011> 35
Retrieval demonstrator
Indexing and retrieval library (java) implemented on the lucene search engine
Lexicon in MySQL database
OCR with Finereader SDK and external dictionary interface of about 2000 images of the Dutch Ground Truth selection
Page XML output [in framework]
NE tagging
Indexing and retrieval while using lexicon and NE tagging