Top Banner
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computer Lexica in OCR and Retrieval Katrien Depuydt (Instituut voor Nederlandse Lexicologie, Leiden)
35
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Computer Lexica in OCR and Retrieval

Katrien Depuydt (Instituut voor Nederlandse Lexicologie, Leiden)

Page 2: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 2

Overview

What is a computer lexicon

Lexica in IMPACT

Tools for lexicon building and applying lexica

Some results

Searching Demonstration

Page 3: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 3

What is a computer lexicon?

Page 4: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 4

Computer lexicon vs electronic dictionary (1)

An electronic dictionary is: Digitised full text (no pictures) For human use Ideally: searchable with explicitely coded material (XML), such as a lemma, part of speech (PoS), meaning, quotes etc. Examples: OED online, WNT online

Page 5: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 5

Dictionary XML (example)

Page 6: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 6

Page 7: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 7

Computer Lexicon vs Electronic Dictionary (2)

A computer lexicon is: Always in a structured digital format (XML, relational database) Main purpose: computer application Explicitely coded information (e.g. lemma, part of speech, morphology, syntax)

Examples of use:

Linguistic enrichment of text material ‘Advanced’ searching (words with all spelling variant and inflections) Automatic summarization, keyword extraction…

Page 8: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 8

Page 9: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 9

Lexica in IMPACT

Page 10: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 10

The OCR lexiconAn OCR lexicon is

A checked list of words in a language Based on a corpus (collection) of dated texts (selection!) Preferably with frequency information Preferably from the same time period or of the same text type as the texts you wish to digitize

Page 11: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 11

OCR lexicon: example1550-1750 > 1900

song 820rihte 818theire 818manye 818sume 815Do 814Whiche 811fyrst 811while 811Water 810wt 809shalbe 808thingis 807again 806sona 806wa 805mode 804work 802between 801law 799moder 798mis 798softe 798

television 418electronic 375video 194hormone 176jazz 162eco 142software 136vitamin 128movie 121taxi 113isotopic 108electronics 95radar 86basically 71sabotage 71homozygote 70psychedelic 67phonemic 66insulin 64zap 64antibody 61fungicidal 61

Page 12: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 12

The IR lexicon IR lexicon: most

important information categoriesword forms (lists of words) +

- frequency information

- quotes (dated sources) from corpora or electronic dictionaries

- MODERN LEMMA (// entrance dictionary) linked to spelling variants and inflected forms of the same word

The modern lemma is used for searching in texts

Standard use in corpus linguistics and modern historical lexicography

Page 13: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 13

<?xml version='1.0'?><!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'><lexicon><lexical_entry><lemma_id>219490</lemma_id><modern_lemma>aantuilen</modern_lemma><gloss></gloss><POS>VRB</POS><ne_label></ne_label><language_id></language_id><portmanteau_lemma_id></portmanteau_lemma_id>

<wordform><form_representation><wordform_id>850026</wordform_id><written_form>tuyld</written_form><attestation><id>92141</id><token_id></token_id><quote>Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an, Vermits een Vrou niet op een Vrou verlieven kan,</quote><derivation_id>0</derivation_id><document_id>204</document_id><start_pos>119</start_pos><end_pos>124</end_pos></attestation></form_representation></wordform>

Page 14: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 14

Tools for lexicon building and application of lexica

Page 15: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 15

Types variation (spelling, inflection…)uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk

I

werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

II

(patterns to predict variation)

(a number are predictable with patterns, others need to be taken from a lexicon )

Page 16: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Neil Fitzgerald, 7th July 2011 16

Page 17: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 17

Computer lexica

For OCR and OCR post correction Improving searchability of historic text material by building

a lexicon with variants by using a modern lemma as a search entry

Tools for lexicon building Tools for application of lexicon in search engines Lexicon cookbook Guidelines and tools to use the lexica in OCR

Page 18: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 18

Tools (more specific)- Lexicon building from corpus material and

dictionaries - Use of lexica in search engines

- Tool to extract spelling variation patterns from historical material

- Tool to relate previously unrecognised spelling variations to their standard form

- Tool to deduct previously unrecognised inflected forms to their basic form

Page 19: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 19

Ordinary words vs Names (NEs)

Tools for the automatic recognition, classification and finding of variant names Wish of the libraries Separate regular vocabulary from names Reduce unpleasant results:

Abimelech apemelk! (b/p; i/e; e/0; k/ch) (apemelk means monkeymilk..)

NE lexica

Page 20: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 20

A number of results for Dutch and German

Page 21: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 21

Ground truth data: DutchType and genre # words

Gold Standard Book 300k

Random Set Books 340k

Random Set Staten Generaal (Legal Papers)

2.5M

Gold Standard Staten Generaal 500k

Gold Standard Newspapers 1 3.4M

Gold Standard Newspapers 2 170k

Random Set Newspapers 3.2M

total 13.1M

Page 22: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 22

Lexicon coverage (1: ground truth books)

Type coverage Token coverage

Modern lexicon (e-Lex) 46% 76%

Core general lexicon 56% 84%

1 + 2 63% 89%

Expansion with corpus material

78% 95%

Page 23: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 23

Lexicon coverage (2: GT newspapers 18th-19th C.)

Type coverage Token coverage

Modern lexicon (e-Lex) 40% 83%

Core general lexicon 41% 84%

1 + 2 51% 89%

Expansion with corpus material

62% 95%

Page 24: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 24

Lexicon coverage (3: GT Staten Generaal 19e C.)

Type coverage Token coverage

Modern lexicon (e-Lex) 51% 89%

Core general lexicon 47% 88%

1 + 2 58% 93%

Expansion with corpus material

68% 97%

Page 25: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 25

Lexicon coverage (4: GT Staten Generaal 20e C.)

Type coverage Token coverage

Modern lexicon (e-Lex) 70% 93%

Core general lexicon 66% 93%

1 + 2 76% 96%

Expansion with corpus material

81% 98%

Page 26: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 26

Lexicon coverage (5: Genesis, 1637 bible)

Type coverage Token coverage

Modern lexicon (e-Lex) 31% 61%

Core lexicon 62% 83%

1 + 2 65% 89%

Expansion with corpus material

87% 98.6%

Page 27: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 27

Lexicon coverage (6: P.C. Hooft, histories)

Type coverage Token coverage

Modern lexicon (e-Lex) 26% 67%

Core lexicon 47% 88%

1 + 2 50% 90%

Expansion with corpus material

58% 96%

Page 28: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 28

Evaluation of OCRFinereader SDK (version 9, 10) External dictionary interface (implementation module) Challenge

Translation of corpus frequencies to weights 0-100 Broken words, case-sensitivity, …Problem with long ‘s’ (work around)

Lexicon DataIMPACT OCR-lexicon for DutchFinereader internal lexicon

Page 29: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 29

OCR results: word recognition rateDataset With ABBYY internal Dutch

lexiconWith IMPACT lexicon for Dutch (case hyphenation)

With IMPACT lexicon for Dutch (case hyphenation) + long S problem)

DPO35 88.8% 90.9% 93,5 %

Page 30: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 30

An example:

OCR at the beginning of the project: Results:

A. De eerde was de gevaarlykflti om de verlei¬ding aan 't Hof; de tweede de ftillie en veiligde;de derde de zwaarde, daar hy byna drie millioenenharde en onbefchaafde Menfchen beftieren moest.

A. De eerste was de gevaarlykste om de verlei-ding aan 't Hof; de tweede de stilste en veiligste;de derde de zwaarste, daar hy byna drie millioenenharde en onbeschaafde Menschen bestieren moest.

Page 31: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 31

Dictionary16th century

No. of word errors

Reduction of error rate

18th century

No. of word errors

Reduction of error rate

19th century

No. of word errors

Reduction of error rate

No Lexicon 1306 - 827 - 2074 -

Optimal Lexicon 756 42% 395 52% 612 70%

Modern Lexicon 1096 16% 501 39% 888 57%

W.Historical Lexicon 938 28% 481 42% 856 59%

Modern + Virtual H.L. 1011 25% 480 42% 849 59%

Page 32: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Languages in IMPACTD

utch, German, English, Spanish, FrenchP

olish, Czech, Slovene and Bulgarian

-Cross language perspective paper

-Parallel OCR and IR experiments

-GT datasets

-Language tools: language independent

-Except from 3 core languages: proof of concept lexica

IMPACT <Demo Day BL, 12 July 2011> 32

Page 33: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

English in IMPACTL

exicon building using OED– OCR lexicon from quotations full text, possibly supplemented with corpus material– IR lexicon from headword variants in quotations (small demo)

Named Entity Recognition on newspaper material

– NE lexicon– Gold standard corpus NE recognition (CONLL)

(Named Entity Recognition Task Definition, by: N. Chinchor, E. Brown, L. Ferro, and P. Robinson , Nr. Version 1.4 (1999) )

PER, LOC, ORGR

esearch into the possible benefits from exclusion of modern words from the OCR lexicon

IMPACT <Demo Day BL, 12 July 2011> 33

Page 34: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 34

An indemnity shall be granted to the surfer….

… bikini …

Page 35: Language Tools for OCR with Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 35

Retrieval demonstrator

Indexing and retrieval library (java) implemented on the lucene search engine

Lexicon in MySQL database

OCR with Finereader SDK and external dictionary interface of about 2000 images of the Dutch Ground Truth selection

Page XML output [in framework]

NE tagging

Indexing and retrieval while using lexicon and NE tagging

35