READ-COOP SCE European Cooperative with limited liability Public Models in Transkribus Last update of this guide: 17/03/2020 This document should give an overview of the publicly available models in Transkribus we offer so far. You will find a short description of the training material, which languages the model can be useful for and who has created and trained it. We are working on making more and more models available for Transkribus users, so they can benefit from the network effect and save work and time. The models in this document can be found in alphabetical order. The abbreviation “CER” in this overview stands for “Character Error Rate” and defines how many percent of the characters had been transcribed the wrong way by the neural network. Download the Transkribus Expert Client, or make sure you are using the latest version: - https://transkribus.eu/ Consult the Transkribus Wiki for further information and other How to Guides: - https://transkribus.eu/wiki/ Transkribus and the technology behind it are made available via the following projects and sites: - https://read.transkribus.eu/ - https://transcriptorium.eu/ - https://github.com/transkribus/ Contact - The Transkribus Team: [email protected]
16
Embed
Public models in Trankribus - Transkribus · Danish Fraktur 19th century Model name: Danish Fraktur SB 19th century v.2.3 Creator: Poul Steen This model is based on more than 500
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
READ-COOP SCE European Cooperative with limited liability
Public Models in Transkribus Last update of this guide: 17/03/2020
This document should give an overview of the publicly available models in Transkribus we offer so far. You
will find a short description of the training material, which languages the model can be useful for and who
has created and trained it. We are working on making more and more models available for Transkribus
users, so they can benefit from the network effect and save work and time.
The models in this document can be found in alphabetical order. The abbreviation “CER” in this overview
stands for “Character Error Rate” and defines how many percent of the characters had been transcribed
the wrong way by the neural network.
Download the Transkribus Expert Client, or make sure you are using the latest version:
- https://transkribus.eu/
Consult the Transkribus Wiki for further information and other How to Guides:
- https://transkribus.eu/wiki/
Transkribus and the technology behind it are made available via the following projects and sites:
This model is based on printed texts in the Roman-type fonts that were used in the Low Countries, during the late 16th, 17th, 18th and 19th century. Some pages may have contained (properly) transcribed Gothic font; as well as French or Latin texts have been included to ensure the (more or less) proper transcription of words in those languages when occuring. The type of sources used for this model, are books of ordinances, which contained the norms ('laws') at the time. About 88 000 words had been trained and the CER on the validation set is 1.17%. This model has been the result of one of the KB National Library of the Netherlands Researcher-in-Residence position 2019. The project was called 'Entangled Histories'. For more information regarding the background of the model and how to cite it, please visit: www.https://lab.kb.nl/dataset/entangled-histories-ordinances-low-countries
English Handwriting 18th-19th century Model name: English Writing M1
Creator: University College London – Bentham project
This model was trained on over 50,000 words from papers written by the English philosopher Jeremy Bentham (1748–1832) and his secretaries. In the best cases, it generates an output where around 95 per cent of characters on similar pages from the Bentham collection are transcribed correctly by the programme. More info about the Bentham project can be found here:
Finnish 19th century Model name: NAF Court Records M10
Creator: National Archives Finland
This model is based on Renovated District Court Records (Fi: Kihlakunnanoikeuksien renovoidut
tuomiokirjat, Swe: Häradsrätternas renoverade domböcker) from the years 1809-1870. Models
training set consists of 2841 double-pages and the validation set 100 double-pages. Since there were
many (dozens) scribes it is a combination of many different handwritings.
The Ground Truth material is picked across Finland from 58 different court districts. Most of the
Ground Truth is in Swedish, but there is also some Finnish since from 1850s some of the court
districts started to write Court Records in Finnish. Renovated District Court Records are split into two
series: Main Records & Notification Records. This model includes mostly Notification Records.
Nevertheless the model also works fine with Main Records. This model was created as part of the
READ project at National Archives of Finland (NAF). It has been used to transcribe the Notification
Records from the years 1809-1870 (all districts). As a result, a search interface has been implemented
where you can perform full text searches and browse automatically transcribed documents. The
search interface and more information can be found at: www.transkribus.eu/r/kws
French 18th Century Print Model name: French_18thC_Print
Creator: Entangled Histories project (National Library Netherlands) This model is based on printed texts in French (Romantype Font) that was used in Flanders (Low Countries), during the 18th century. The type of sources used for this model, are books of ordinances, which contained the norms ('laws') at the time. This model has been the result of one of the KB National Library of the Netherlands Researcher-in-Residence position 2019. The project was called 'Entangled Histories'. The training set counts about 38 500 words and the CER on the validation set is 0.65%. The books used for this specific model, have been provided by the Bodleian Library Oxford (RECUEIL DES ÉDITS, DÉCLARATIONS, LETTRES-PATENTES, &c. ENREGISTRÉS AU PARLEMENT DE FLANDRES). For more information regarding the background of the model and how to cite it, please visit: www.https://lab.kb.nl/dataset/entangled-histories-ordinances-low-countries
French 17th Century Print Model name: Parallèle des Anciens et des Modernes M2
Creator: Project: Un choc de modernité : Anciens et Modernes au tournant des XVIIe et XVIIIe siècles
This model is based on a printed text in French at the end of 17th century : Parallèle des Anciens et des Modernes by Charles Perrault (1688-1697, publisher : Jean-Baptiste Coignard).
It was trained fort a digital edition as part of the project "Un choc de modernité : Anciens et Modernes au tournant des XVIIe et XVIIIe siècles" (IHRIM UMR 5317). More than 65 000 words have been trained and the CER is 2,70%.
French and Latin Chancery documents Model name: HIMANIS Chancery M1+
Creator: HIMANIS project
As part of the HIMANIS project (lead by D. Stutzmann, C. Kermorvant & E. Vidal), the text edition
provided by P. Guérin and encoded in TEI by the Ecole nationale des Chartes
(http://corpus.enc.sorbonne.fr/actesroyauxdupoitou/) and the one by J. Viard were aligned at line
level and used to train this comprehensive model for French and Latin Chancery documents. The
training set includes about 666 000 words and the CER goes down to 5.33% on the validation set.
More information on the project can be found at: http://himanis.huma-num.fr/himanis/