Top Banner
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Overview of the Language Work in IMPACT Katrien Depuydt Institute for Dutch Lexicology Leiden
46

IMPACT Final Conference - Katrien Depuydt

Jun 14, 2015

Download

Education

Overview of the IMPACT language work with Katrien Depuydt from the INL
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Overview of the Language Work in IMPACT

Katrien Depuydt

Institute for Dutch Lexicology

Leiden

Page 2: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMProving ACcess to Text

Page 3: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

3

Can we handle ‘de wereld’ (‘the world’)’?

OCR:

werreid

Page 4: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

4

OCR:Abbyy Finereader SDK with built in standard Dutch dictionary

OCR:Abbyy Finereader SDK combining built in modernDutch dictionary with IMPACT external historical lexicon of Dutch:

werreld

Page 5: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

5

werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

RETRIEVAL: key in modern WERELD and find all

Page 6: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lexica in IMPACT

Page 7: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7

The OCR lexicon

A checked list of words in a language Based on a corpus (collection) of dated texts (selection!) Preferably with frequency information Preferably from the same time period or of the same text type as the texts you wish to digitize

For OCR and OCR postcorrection

Page 8: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

8

OCR lexicon: example1550-1750 > 1900

song 820rihte 818theire 818manye 818sume 815Do 814Whiche 811fyrst 811while 811Water 810wt 809shalbe 808thingis 807again 806sona 806wa 805mode 804work 802between 801law 799moder 798mis 798softe 798

television 418electronic 375video 194hormone 176jazz 162eco 142software 136vitamin 128movie 121taxi 113isotopic 108electronics 95radar 86basically 71sabotage 71homozygote 70psychedelic 67phonemic 66insulin 64zap 64antibody 61fungicidal 61

Page 9: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

9

The IR lexicon IR lexicon: most important information categories

word forms (lists of words) + - frequency information

- quotes (dated sources) from corpora or electronic dictionaries- MODERN LEMMA (// entrance dictionary) linked to spelling variants and inflected forms of the same word

The modern lemma is used for searching in texts

Standard use in corpus linguistics and modern historical lexicography

Page 10: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

NE lexicaLexica for OCR and NE Recognition and variant matching in

historical documents!

- English, German and Dutch- Stanford NE tagger with additonal IMPACT module- NE repository with gazetteers and authority files

Parallel session: Frank Landsbergen on the NE work in IMPACT

Page 11: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Strategies, material and Toolbox for Lexicon building

Page 12: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

12

Types variation (spelling, inflection…)uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk

I

werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

II

(patterns to predict variation)

(a number are predictable with patterns, others need to be taken from a lexicon )

Page 13: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Material for lexicon building:

- historical dictionaries with quotations (OED, WNT)- corpus material, ground truth quality- list of dictionary entries- modern or historical language computional lexica

Page 14: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Toolbox (a selection):

-Tool to automatically derive spelling variation rules from a datasetof historical word forms with their modern equivalent

eg. To be used to predict historical forms starting from a modernlexicon

-Tool to automatically expand a list of dictionary entries with inflectionalvariants (“reverse lemmatisation”)

-Tool to lemmatise word formsHistorical word > standard (“modern”) spelling > lemma form (pattern matching) (lemmatizer)

Dystels > (1) > distels (2) > distel

Page 15: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 16: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 17: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

17

Corpus-based lexicon building (COBALT)

Page 18: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

18

Improvement of state of the art / innovation

We use existing computational linguistic approaches, but figure out how to apply them to historical language

We develop a workflow to deal with the problems posed by historical language, figuring out how all pieces fit together Data selection and acquisition Manual work Computational linguistics tools

Page 19: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Cross-language perspective on lexicon building

Page 20: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Multi-Language Lexicon Building: Challenges

• Different points of departure– For which periods does historical lexicon building make sense?– What language resources (lexica, corpora, dictionaries) are available?– What tools are available?– Special character sets (Polish, Bulgarian)

• Set up fruitful cooperation with many institutes (“training”)– General meetings– Individual training sessions by LMU and INL– Extensive testing of tools, additional feature requirements

Page 21: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Different languages, different periods

Page 22: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Resources for lexicon building  Dictionaries Corpora LexicaBulgarian Ground Truth, Early

OCR 

Czech Jungmann, Kott Ground Truth, Czech National Corpus

Based on modern dictionary

English OED Ground Truth  French   Ground Truth,

Frantextmorphalou

Polish The dictionary of 17th and early 18th century Polish

Ground truth Grammatical dictionary of polish

Slovene   AHLib, wikisource, Ground Truth

Multext-east lexicon

Spanish Diccionario de Autoridades, Real Academia Española

Cervantes Virtual Library, Ground Truth

Apertium lexicon

Page 23: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Issues and challengesLanguage Issues CountermeasuresBulgarian Some characters in late 19th

century bulgarian not recognized by FineReader; Old Church Slavonic printing not at all implementedLack of sufficient corpus material

Special font training; lexicon development ground truth

Czech Lack of sufficient corpus material

lexicon development ground truth

Polish Special Glyphs; Lack of sufficient corpus material

lexicon development ground truth

Page 24: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

COBALTNew features:• Page XML and TEI import• Major adaption: make tool suitable for lexicon building with

OCR material• Highlighting of (suspicious) words in page image• Editing of word forms

• Many small enhancements to improve usability at the request of users (new language partners)

Page 25: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Parallell language session

- Annette Gotscharek: Work on 16th Century German

- Janusz Bien: Work on Polish language

- Tomaž Erjavec: Work on Slovene

Page 26: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Use of Lexica in OCR

Page 27: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

ABBYY External dictionary interface

Use of historical lexica within Finereader SDK (FR 9 and 10)Implemented as web service in OC5 frameworkPossible enhancements

Morphological structure: integrated in the external dictionary implementationHistorical spelling variation patterns

Cf. Talk by Jesse de Does on a.o. OCR results

Page 28: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 29: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lexica in Retrieval

Page 30: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

30

Retrieval demonstrator

Indexing and retrieval library (java) implemented on the lucene search engine

Lexicon in MySQL database

Page XML [in framework], also suitable for other XML-formats

NE tagging

Indexing and retrieval while using lexicon and NE tagging

30

Page 31: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Neil Fitzgerald, 7th July 2011 31

Page 32: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 33: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 34: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 35: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 36: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 37: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 38: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

38

Page 39: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

39

Page 40: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

40

Page 41: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

41

Page 42: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

42

Page 43: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

43

Page 44: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 45: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 46: IMPACT Final Conference - Katrien Depuydt

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Final remarks

- Cross language perspective paper

- Lexicon cookbook + toolbox

- Lexica