Top Banner
READ-COOP SCE European Cooperative with limited liability Public Models in Transkribus Last update of this guide: 17/03/2020 This document should give an overview of the publicly available models in Transkribus we offer so far. You will find a short description of the training material, which languages the model can be useful for and who has created and trained it. We are working on making more and more models available for Transkribus users, so they can benefit from the network effect and save work and time. The models in this document can be found in alphabetical order. The abbreviation “CER” in this overview stands for “Character Error Rate” and defines how many percent of the characters had been transcribed the wrong way by the neural network. Download the Transkribus Expert Client, or make sure you are using the latest version: - https://transkribus.eu/ Consult the Transkribus Wiki for further information and other How to Guides: - https://transkribus.eu/wiki/ Transkribus and the technology behind it are made available via the following projects and sites: - https://read.transkribus.eu/ - https://transcriptorium.eu/ - https://github.com/transkribus/ Contact - The Transkribus Team: [email protected]
16

Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

Mar 15, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

READ-COOP SCE European Cooperative with limited liability

Public Models in Transkribus Last update of this guide: 17/03/2020

This document should give an overview of the publicly available models in Transkribus we offer so far. You

will find a short description of the training material, which languages the model can be useful for and who

has created and trained it. We are working on making more and more models available for Transkribus

users, so they can benefit from the network effect and save work and time.

The models in this document can be found in alphabetical order. The abbreviation “CER” in this overview

stands for “Character Error Rate” and defines how many percent of the characters had been transcribed

the wrong way by the neural network.

Download the Transkribus Expert Client, or make sure you are using the latest version:

- https://transkribus.eu/

Consult the Transkribus Wiki for further information and other How to Guides:

- https://transkribus.eu/wiki/

Transkribus and the technology behind it are made available via the following projects and sites:

- https://read.transkribus.eu/

- https://transcriptorium.eu/

- https://github.com/transkribus/

Contact

- The Transkribus Team: [email protected]

Page 2: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

2 Public models in Trankribus

Contents

Danish 19th-20th century ...................................................................................................................... 3

Danish 20th century ............................................................................................................................. 3

Danish Fraktur 19th century ................................................................................................................. 3

Danish Handwriting 1881-1913 ........................................................................................................... 4

Devanagari Mixed 19th-20th century .................................................................................................... 5

Devanagari Nagara 19th century .......................................................................................................... 5

Dutch Gothic Print 16th-18th century ................................................................................................... 6

Dutch Handwriting .............................................................................................................................. 6

Dutch late 17th century ........................................................................................................................ 7

Dutch Notarial 18th century ................................................................................................................. 7

Dutch Poetry 1603-1636 ..................................................................................................................... 7

Dutch Romantype Print 16th-19th century ........................................................................................... 8

English Handwriting 18th-19th century ................................................................................................. 8

Finnish 19th century ............................................................................................................................. 8

French 18th Century Print .................................................................................................................... 9

French 17th Century Print .................................................................................................................... 9

French and Latin Chancery documents ............................................................................................. 10

French Livre Rouge ............................................................................................................................ 11

German Fraktur 19th-20th century ..................................................................................................... 12

German Fraktur 18th-20th century ..................................................................................................... 12

German Kurrent and Sütterlin 17th-20th century ............................................................................... 13

Latin (Greek, German, English, Italian) 16th-18th century .................................................................. 13

Russian Church Slavonic .................................................................................................................... 14

Swedish 17th century ......................................................................................................................... 15

Credits ................................................................................................................................................... 16

Page 3: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

3 Public models in Trankribus

The Transkribus Platform is provided by the European Cooperative READ-

COOP SCE.

Until June 2019 Transkribus was financed as part of the Horizon 2020

READ-project under grant agreement No. 674943.

Danish 19th-20th century Model name: Danish 1870-1950

Creator: Aarhus City Archives

This is a general model for Danish Handwriting from late 19th and 20th Century.

It is based on the model which follows next in this document (RoyalDanishLibrary_20thCentury+) and

parish council minutes from Aarhus City Archives and the CER is 4.28%. The model has been created

by Jan Mattias Jonsson Agger at Aarhus City Archives based on work by volunteers at the City

Archives and the work from Jakob K. Meile & the staff at the Royal Danish Library.

Danish 20th century Model name: RoyalDanishLibrary_20thCentury+

Creator: Royal Danish Library

This is a general model for Danish cursive handwriting of the 20th century based on 16 different

scribes. It had been created by Jakob K. Meile and his collegues in the Royal Danish Library. About

580 400 words had been trained and the CER goes down to 3.99% on the validation set. More

information about the Royal Danish Library and its projects can be found out here:

https://www.kb.dk/en/

Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3

Creator: Poul Steen

This model is based on more than 500 pages (about 34 000 words) from Royal Danish Court & State

Calendar, Danish High Court Proceedings from the 19th century. The CER goes down to 1.59%.

Page 4: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

4 Public models in Trankribus

Danish Handwriting 1881-1913 Model name: Gjentofte 1881-1913 Denmark 1000 epochs

Creator: Gentofte Community Archive Transkribus Team

This model is based on protocols from meeting in the locally elected community counsel. It is written

in turn by the counsel members during the meeting and of varying quality, with several corrections

and inserted additions between the lines. Use of non-standard abbreviations by some writers. More

than 154 000 words have been trained and the CER is 4,43%.

Page 5: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

5 Public models in Trankribus

Devanagari Mixed 19th-20th century Model name: Devanagari mixed M1

Creator: Heidelberg University Library

This model recognizes the South Asian Devanagari-script. It is based on ca. 200 pages of late 19th and

early 20th century books by the Indian Naval Kishore Press. The books were mainly printed in lead

typesetting, but the training data also contains pages produced lithographically. The model is

provided by Heidelberg University Library as part of the FID Asien project. Text and data of Naval

Kishore Press – can be digitally accessed here: https://digi.ub.uni-

heidelberg.de/en/sammlungen/suedasien/navalkishore.html

Devanagari Nagara 19th century Model name: Devanagari_nagara_M1

Creator: Heidelberg University Library

The model recognized South Asian Devanagari-script. It is based on 65 pages of late 19th century

books by the Indian Naval Kishore Press, all printed with the same type. The model is provided by

Heidelberg University Library as part of the FID Asien project. Text and data of Naval Kishore Press –

can be digitally accessed here: https://digi.ub.uni-

heidelberg.de/en/sammlungen/suedasien/navalkishore.html

Page 6: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

6 Public models in Trankribus

Dutch Gothic Print 16th-18th century Model name: Dutch_Gothic_Print

Creator: Entangled Histories (National Library Netherlands)

This model is based on printed texts in the Gothic font that was used in the Low Countries, during the

16th, 17th and 18th century. The type of sources used for this model, are books of ordinances, which

contained the norms ('laws') at the time. This model has been the result of one of the KB National

Library of the Netherlands Researcher-in-Residence position 2019. The project was called 'Entangled

Histories'. About 51 100 words had been trained for this model and the CER on the validation set is

1.71%. For more information regarding the background of the model and how to cite it, please visit:

www.https://lab.kb.nl/dataset/entangled-histories-ordinances-low-countries

Dutch Handwriting Model name: NAN/NHA_GT_M3+

Creator: National Archives Netherlands

The digitisation team around Liesbeth Keyser from the National Archives in the Netherlands is

working hard on creating training data for their collections in order to prepare HTR processing on a

large scale. As a first result a model based on 475.769 words is now made available for Transkribus

users. The model shows a Character Error Rate of 7.48% on the training set and 6.15% on the

validation set. It is based on the careful transcription of dozens of different handwritings and

comprises scans from the Incoming Documents from the Dutch East India Company (Overgekomen

Page 7: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

7 Public models in Trankribus

Brieven en Papieren van de VOC) of the National Archives of the Netherlands and of 19th century

Notarial deeds from the Noord-Hollands archief. The model is named: NAN/NHA_GT_M3+ Enjoy!

Dutch late 17th century Model name: Dutch Margaretha Turnor 17th Century

Creator: The Utrecht Archives

This is the first model created by the Utrecht Archives. It is based on a thousand letters of

Margaretha Turnor, who wrote to her husband during the late 17th century. She managed the castle

of Amerongen, while her husband worked abroad as a diplomat for the Dutch Republic. Her letters

provide an insight into family life in the Dutch Republic as well as the political situation in the

country. About 36 000 words had been trained for this model and the CER on the validation set is

1,83%.

Dutch Notarial 18th century Model name: Dutch Notarial Model 18th Century

Creator: City Archives of Amsterdam

This is the first 18th Century general model created by the City Archives of Amsterdam. It is based on

thousands of scans from in total 15 different notaries who worked in Amsterdam during the 18th

Century. All notaries (except Van Hoorn and Van Esterwege) have 10 scans validation included (2671

scans training, 130 for validation). The number of trained words is about 623 000 and the CER is

5.27% on the validation set.

Dutch Poetry 1603-1636 Model name: Dutch poetry 1603-1636

Creator: Bram Caers

The model was trained on an extensive manuscript of early modern poetry, in separate hands (of

which one is the most important) using different types of writing and special lay-outs (e.g.

chronograms).

Page 8: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

8 Public models in Trankribus

The author of the manuscript is a rhetorician (vernacular poet) from Mechelen, present-day Belgium,

active in the first decades of the seventeenth century. The training was based on a word count of

over 51,000 words (more than 200 folios of text) and the CER is 4,78%.

Dutch Romantype Print 16th-19th century Model name: Dutch_Romantype_Print

Creator: Entangled Histories project (National Archives Netherlands)

This model is based on printed texts in the Roman-type fonts that were used in the Low Countries, during the late 16th, 17th, 18th and 19th century. Some pages may have contained (properly) transcribed Gothic font; as well as French or Latin texts have been included to ensure the (more or less) proper transcription of words in those languages when occuring. The type of sources used for this model, are books of ordinances, which contained the norms ('laws') at the time. About 88 000 words had been trained and the CER on the validation set is 1.17%. This model has been the result of one of the KB National Library of the Netherlands Researcher-in-Residence position 2019. The project was called 'Entangled Histories'. For more information regarding the background of the model and how to cite it, please visit: www.https://lab.kb.nl/dataset/entangled-histories-ordinances-low-countries

English Handwriting 18th-19th century Model name: English Writing M1

Creator: University College London – Bentham project

This model was trained on over 50,000 words from papers written by the English philosopher Jeremy Bentham (1748–1832) and his secretaries. In the best cases, it generates an output where around 95 per cent of characters on similar pages from the Bentham collection are transcribed correctly by the programme. More info about the Bentham project can be found here:

Finnish 19th century Model name: NAF Court Records M10

Creator: National Archives Finland

This model is based on Renovated District Court Records (Fi: Kihlakunnanoikeuksien renovoidut

tuomiokirjat, Swe: Häradsrätternas renoverade domböcker) from the years 1809-1870. Models

training set consists of 2841 double-pages and the validation set 100 double-pages. Since there were

many (dozens) scribes it is a combination of many different handwritings.

The Ground Truth material is picked across Finland from 58 different court districts. Most of the

Ground Truth is in Swedish, but there is also some Finnish since from 1850s some of the court

districts started to write Court Records in Finnish. Renovated District Court Records are split into two

series: Main Records & Notification Records. This model includes mostly Notification Records.

Nevertheless the model also works fine with Main Records. This model was created as part of the

READ project at National Archives of Finland (NAF). It has been used to transcribe the Notification

Records from the years 1809-1870 (all districts). As a result, a search interface has been implemented

where you can perform full text searches and browse automatically transcribed documents. The

search interface and more information can be found at: www.transkribus.eu/r/kws

Page 9: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

9 Public models in Trankribus

French 18th Century Print Model name: French_18thC_Print

Creator: Entangled Histories project (National Library Netherlands) This model is based on printed texts in French (Romantype Font) that was used in Flanders (Low Countries), during the 18th century. The type of sources used for this model, are books of ordinances, which contained the norms ('laws') at the time. This model has been the result of one of the KB National Library of the Netherlands Researcher-in-Residence position 2019. The project was called 'Entangled Histories'. The training set counts about 38 500 words and the CER on the validation set is 0.65%. The books used for this specific model, have been provided by the Bodleian Library Oxford (RECUEIL DES ÉDITS, DÉCLARATIONS, LETTRES-PATENTES, &c. ENREGISTRÉS AU PARLEMENT DE FLANDRES). For more information regarding the background of the model and how to cite it, please visit: www.https://lab.kb.nl/dataset/entangled-histories-ordinances-low-countries

French 17th Century Print Model name: Parallèle des Anciens et des Modernes M2

Creator: Project: Un choc de modernité : Anciens et Modernes au tournant des XVIIe et XVIIIe siècles

This model is based on a printed text in French at the end of 17th century : Parallèle des Anciens et des Modernes by Charles Perrault (1688-1697, publisher : Jean-Baptiste Coignard).

Page 10: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

10 Public models in Trankribus

It was trained fort a digital edition as part of the project "Un choc de modernité : Anciens et Modernes au tournant des XVIIe et XVIIIe siècles" (IHRIM UMR 5317). More than 65 000 words have been trained and the CER is 2,70%.

French and Latin Chancery documents Model name: HIMANIS Chancery M1+

Creator: HIMANIS project

As part of the HIMANIS project (lead by D. Stutzmann, C. Kermorvant & E. Vidal), the text edition

provided by P. Guérin and encoded in TEI by the Ecole nationale des Chartes

(http://corpus.enc.sorbonne.fr/actesroyauxdupoitou/) and the one by J. Viard were aligned at line

level and used to train this comprehensive model for French and Latin Chancery documents. The

training set includes about 666 000 words and the CER goes down to 5.33% on the validation set.

More information on the project can be found at: http://himanis.huma-num.fr/himanis/

Page 11: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

11 Public models in Trankribus

French Livre Rouge Model name: LaMOP-Livre_Rouge_1

Creator: Paris University

This model is based on the book "Y//3 Livre Rouge, Châtelet de Paris (11..-1790)" (Archives

Nationales de France) and the model was released by Hugo Regazzi (Universite Paris 1/LaMOP),

Pierre Brochard (CNRS/LaMOP) and Julie Claustre (Universite Paris 1/LaMOP). All the data and

pictures can be found at: https://gitlab.huma-num.fr/lamop/htr/blob/master/Livre_Rouge-

Archives_Nationales/README.txt

20 000 words have been trained for this model and the error rate is 8%.

Page 12: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

12 Public models in Trankribus

German Fraktur 19th-20th century Model name: ONB_Newseye_GT_M1+

Creator: Austrian National Library and NewsEye project

Thanks to the Library Labs of the Austrian National Library and the NewsEye project we are happy to

announce the release of a free model which is capable to read German Fraktur documents especially

from the 19th and 20th century in a convincing quality outperforming most standard OCR engines.

The model is based on training data coming from the ANNO collection of the Austrian National

Library and comprises 442.141 words. It shows a CER of 1,55% on the training set and 1,65% on the

test set without any dictionary support. Note: the model is trained on German language documents.

It will provide less convincing results for other languages, such as Swedish or Finnish Fraktur.

However models for these languages are also in preparation and may be released in the coming

months. The Fraktur model is available for every registered user in Transkribus and called: ONB

_Newseye_GT_M1+. Have fun!

German Fraktur 18th-20th century Model name: NZZ Gold Standard M1+

Creator: University of Zurich

The model is based on 167 title pages from the Neue Zürcher Zeitung (NZZ) covering the years 1780

to 1940. About 273 400 words had been trained for this model and the CER on the validation set is

0.45% (every 10th page has been taken as validation set). The model is provided by the

Computational Linguistics Group (Simon Clematide, Philip Ströbel) from the University of Zurich

within the framework of the Impresso project. https://impresso-project.ch/

Page 13: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

13 Public models in Trankribus

German Kurrent and Sütterlin 17th-20th century Model name: German Kurrent M1+

Creator: Transkribus Team, University of Innsbruck

This is a global model, which recognizes German Kurrent, Sütterlin and Fraktur scripts from 17th to

20th century. The training data set includes nearly 500 000 words and has a CER on the validation set

of 5.29%.

Latin (Greek, German, English, Italian) 16th-18th century Model name: Noscemus GM v1

Creator: Noscemus project (University of Innsbruck)

Page 14: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

14 Public models in Trankribus

he Noscemus general model is able to read printed Latin text, especially from the 16th, 17th and

18th century. The model was released by Stefan Zathammer and is based on training data coming

from the Digital Sourcebook of the Noscemus project. The model is tailored towards transcribing

(Neo-)Latin texts set in Antiqua-based typfaces, but it also, to a certain degree, is able to handle

Greek words and words set in (German) Fraktur. The model comprises 170658 words and 27296

lines, it shows a CER of 0.87% on the training set and 0.92% on the validation set.

Russian Church Slavonic Model name:

• Combined_Full_VKS_2

• VMC_Test_4+

Creator: Achim Rabus (University of Freiburg)

Page 15: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

15 Public models in Trankribus

Prof. Achim Rabus from the University of Freiburg has released two specialized models which are

able to read Russian Curch Slavonic. The first model is called VMC_Test_4+: Training data consist of

parts of the Russian Church Slavonic Great Reading Menology (16th century). The model is tailored

towards transcribing Cyrillic semi-uncial script from the 16th century. Character Error Rates for the

training data are 3.72% and for the validation set 3.92% and for the validation set 3.82%.

The second model is called: Combined_Full_VKS_2: Training data consist of parts of the Russian

Church Slavonic Great Reading Menology (16th century), Old Church Slavonic Codex Suprasliensis

(11th century), and the 11th century manuscript of the Catecheses of Cyril of Jerusalem. This is a

generic model suitable for transcribing a variety of Old Cyrillic script styles including uncial and semi-

uncial. Character Error Rates for the training data are 4.42% and for the validation set 3.92%.

Achim has written a detailed report about his usage of Transkribus. Though it deals with Church

Slavonic it is definitely interesting for other users as well. Thanks a lot!

Swedish 17th century The model "Jaemtlands_domsagasM1+” is trained on 5946 pages (ca. 491 300 words) from court

books from Jämtland county in Sweden - Jämtlands läns domsaga, from the years 1647-1688. The

books are the original ones written by different local writers on location (not the copies that were

written later and sent in to the royal court in Stockholm – “renoverade domböcker”). The texts are

written in Swedish. The transcripts that are used are not 100% true to the original spelling. Some

abbreviations are spelled out (for example r:dr = riksdaler) there are also a few remarks made in the

transcripts in brackets. The CER is 6.32%.

Page 16: Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500

16 Public models in Trankribus

Credits We would like to thank our users who have made it possible to publicise these models.