IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Overview of the Language Work in IMPACT Katrien Depuydt Institute for Dutch Lexicology Leiden
Jun 14, 2015
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Overview of the Language Work in IMPACT
Katrien Depuydt
Institute for Dutch Lexicology
Leiden
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMProving ACcess to Text
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
3
Can we handle ‘de wereld’ (‘the world’)’?
OCR:
werreid
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
4
OCR:Abbyy Finereader SDK with built in standard Dutch dictionary
OCR:Abbyy Finereader SDK combining built in modernDutch dictionary with IMPACT external historical lexicon of Dutch:
werreld
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
5
werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
RETRIEVAL: key in modern WERELD and find all
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexica in IMPACT
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
7
The OCR lexicon
A checked list of words in a language Based on a corpus (collection) of dated texts (selection!) Preferably with frequency information Preferably from the same time period or of the same text type as the texts you wish to digitize
For OCR and OCR postcorrection
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
8
OCR lexicon: example1550-1750 > 1900
song 820rihte 818theire 818manye 818sume 815Do 814Whiche 811fyrst 811while 811Water 810wt 809shalbe 808thingis 807again 806sona 806wa 805mode 804work 802between 801law 799moder 798mis 798softe 798
television 418electronic 375video 194hormone 176jazz 162eco 142software 136vitamin 128movie 121taxi 113isotopic 108electronics 95radar 86basically 71sabotage 71homozygote 70psychedelic 67phonemic 66insulin 64zap 64antibody 61fungicidal 61
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
9
The IR lexicon IR lexicon: most important information categories
word forms (lists of words) + - frequency information
- quotes (dated sources) from corpora or electronic dictionaries- MODERN LEMMA (// entrance dictionary) linked to spelling variants and inflected forms of the same word
The modern lemma is used for searching in texts
Standard use in corpus linguistics and modern historical lexicography
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
NE lexicaLexica for OCR and NE Recognition and variant matching in
historical documents!
- English, German and Dutch- Stanford NE tagger with additonal IMPACT module- NE repository with gazetteers and authority files
Parallel session: Frank Landsbergen on the NE work in IMPACT
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Strategies, material and Toolbox for Lexicon building
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
12
Types variation (spelling, inflection…)uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk
I
werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
II
(patterns to predict variation)
(a number are predictable with patterns, others need to be taken from a lexicon )
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Material for lexicon building:
- historical dictionaries with quotations (OED, WNT)- corpus material, ground truth quality- list of dictionary entries- modern or historical language computional lexica
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Toolbox (a selection):
-Tool to automatically derive spelling variation rules from a datasetof historical word forms with their modern equivalent
eg. To be used to predict historical forms starting from a modernlexicon
-Tool to automatically expand a list of dictionary entries with inflectionalvariants (“reverse lemmatisation”)
-Tool to lemmatise word formsHistorical word > standard (“modern”) spelling > lemma form (pattern matching) (lemmatizer)
Dystels > (1) > distels (2) > distel
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
17
Corpus-based lexicon building (COBALT)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18
Improvement of state of the art / innovation
We use existing computational linguistic approaches, but figure out how to apply them to historical language
We develop a workflow to deal with the problems posed by historical language, figuring out how all pieces fit together Data selection and acquisition Manual work Computational linguistics tools
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Cross-language perspective on lexicon building
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Multi-Language Lexicon Building: Challenges
• Different points of departure– For which periods does historical lexicon building make sense?– What language resources (lexica, corpora, dictionaries) are available?– What tools are available?– Special character sets (Polish, Bulgarian)
• Set up fruitful cooperation with many institutes (“training”)– General meetings– Individual training sessions by LMU and INL– Extensive testing of tools, additional feature requirements
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Different languages, different periods
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Resources for lexicon building Dictionaries Corpora LexicaBulgarian Ground Truth, Early
OCR
Czech Jungmann, Kott Ground Truth, Czech National Corpus
Based on modern dictionary
English OED Ground Truth French Ground Truth,
Frantextmorphalou
Polish The dictionary of 17th and early 18th century Polish
Ground truth Grammatical dictionary of polish
Slovene AHLib, wikisource, Ground Truth
Multext-east lexicon
Spanish Diccionario de Autoridades, Real Academia Española
Cervantes Virtual Library, Ground Truth
Apertium lexicon
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Issues and challengesLanguage Issues CountermeasuresBulgarian Some characters in late 19th
century bulgarian not recognized by FineReader; Old Church Slavonic printing not at all implementedLack of sufficient corpus material
Special font training; lexicon development ground truth
Czech Lack of sufficient corpus material
lexicon development ground truth
Polish Special Glyphs; Lack of sufficient corpus material
lexicon development ground truth
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
COBALTNew features:• Page XML and TEI import• Major adaption: make tool suitable for lexicon building with
OCR material• Highlighting of (suspicious) words in page image• Editing of word forms
• Many small enhancements to improve usability at the request of users (new language partners)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Parallell language session
- Annette Gotscharek: Work on 16th Century German
- Janusz Bien: Work on Polish language
- Tomaž Erjavec: Work on Slovene
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Use of Lexica in OCR
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
ABBYY External dictionary interface
Use of historical lexica within Finereader SDK (FR 9 and 10)Implemented as web service in OC5 frameworkPossible enhancements
Morphological structure: integrated in the external dictionary implementationHistorical spelling variation patterns
Cf. Talk by Jesse de Does on a.o. OCR results
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexica in Retrieval
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
30
Retrieval demonstrator
Indexing and retrieval library (java) implemented on the lucene search engine
Lexicon in MySQL database
Page XML [in framework], also suitable for other XML-formats
NE tagging
Indexing and retrieval while using lexicon and NE tagging
30
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Neil Fitzgerald, 7th July 2011 31
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
38
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
39
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
40
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
41
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
42
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
43
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Final remarks
- Cross language perspective paper
- Lexicon cookbook + toolbox
- Lexica