D6.3.2: Evaluation of Public DIA, HTR & KWS Platforms Tim Causer (UCL), Silvia Arango (ULCC), Rory McNicholl (ULCC), Günter Mühlberger (UIBK), Philip Kahle (UIBK), Sebastian Colutto (UIBK) Distribution: Public tranScriptorium ICT Project 600707 Deliverable 6.3.2 December 31, 2015 Project funded by the European Community under the Seventh Framework Programme for Research and Technological Development
38
Embed
D6.3.2: Evaluation of Public DIA, HTR & KWS Platformstranscriptorium.eu/pdfs/deliverables/tranScriptorium-D6.3.2-31December2015.pdf · Thematic Priority ICT-2011.8.2 ICT for access
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Project funded by the European Community under the Seventh Framework Programme for
Research and Technological Development
2
Project ref no. ICT-600707
Project acronym TranScriptorium
Project full title tranScriptorium
Instrument STREP
Thematic Priority ICT-2011.8.2 ICT for access to cultural resources
Start date / duration 01 January 2013 / 36 Months
Distribution Public
Contractual date of delivery December 31, 2015
Actual date of delivery January 9, 2016
Date of last update January 9, 2016
Deliverable number 6.3.2
Deliverable title Evaluation of Public DIA, HTR and KWS Platforms
Type Report
Status & version Final
Number of pages 38
Contributing WP(s) 6
WP / Task responsible UCL
Other contributors UIBK, ULCC
Internal reviewer Joan Andreu Sánchez
Author(s) Tim Causer, Silvia Arango, Rory McNicholl, Günter Mühlberger, Philip Kahle, Sebastian Colutto
EC project officer Jose María del Águila
Keywords
The partners in tranScriptorium are: Universitat Politècnica de València - UPVLC (Spain) University of Innsbruck - UIBK (Austria) National Center for Scientific Research “Demokritos” - NCSR (Greece) University College London - UCL (UK) Institute for Dutch Lexicology - INL (Netherlands) University London Computer Centre - ULCC (UK) For copies of reports, updates on project activities and other tranScriptorium related information, contact: The tranScriptorium Project Co-ordinator Joan Andreu Sánchez, Universitat Politècnica de València Camí de Vera s/n. 46022 València, Spain [email protected] Phone (34) 96 387 7358 - (34) 699 348 523 Copies of reports and other material can also be accessed via the project’s homepage: http://www.transcriptorium.eu/
This document covers the design of the two versions of the Transcription Graphical User Interface (TrGUI), and their integration with different technologies developed as part of the tranScriptorium project.
The Transcription Graphical User Interface (TrGUI) is divided in this report into two scenarios: i) The Crowdsourcing Platform, known as and referred to here as TSX; ii) and the TrGUI itself, known as and referred to here as Transkribus.
This report describes and evaluates the development of the TSX platform (T6.1), a lightweight crowdsourcing client which is based upon the Transkribus infrastructure. The report also evaluates transcripts produced by users using TSX, and compares them with transcripts produced in the course of the Transcribe Bentham initiative. Finally, conclusions are drawn about the potential advantages and disadvantages of introducing a crowdsourcing platform which incorporates HTR technology, and other technologies developed during the course of the tranScriptorium programme.
The report also describes and evaluates the development of the Transkribus platform for content providers (T6.3).
4
1. Table of Contents Executive Summary 3
1. Introduction 6
1.1. Background 6
1.2. WP6 Tasks and status 6
2. Crowdsourcing HTR: TSX 11
2.1. TSX: rationale and development 12
2.2. TSX: administrative workflow 16
3. HTR at Content Provider Portals 19
3.1. Summary 19
3.2. Evaluation 20
4. Crowdsourcing HTR: evaluation of TSX 22
4.1. Transcribe Bentham: context and background data 22
4.2. Comparison of Transcribe Bentham and TSX 25
4.3. TSX statistics, and user interactions 31
4.4. Word Error Rate 33
4.5. Cost-efficiency of crowdsourced transcription 34
5. Conclusion 38
Table of Figures2. Figure 2.1: visualization of how TSX is integrated with Transkribus 12 Figure 2.1.1: TSX beta version: front page 14
Figure 2.1.3: TSX beta version – transcription interface 14
Figure 2.1.4: TSX – current version 16 Figure 4.1.1: Transcribe Bentham quality-control workflow 23 Figure 4.1.2: Volume of work carried out by users for Transcribe Bentham, 1 October 2012 to 27 June 2014, showing the overall data, and for both iterations of the Transcription Desk.
25
Figure 4.2.1: Outline comparison of the quality of transcripts submitted via the
Transcribe Bentham Transcription Desk (overall, and during Period B), with those
submitted via TSX.
26
Figure 4.2.2: Time spent checking transcripts submitted using a) the first iteration of
the Transcription Desk, 1 Oct 2012 to 14 July 2013 (blue); b) the second iteration of the
Transcription Desk, 15 July 2013 to 27 June 2014 (red); and c) TSX.
27
Figure 4.2.3: Errors per thousand words, comparing transcripts submitted using the
Transcription Desk and TSX
28
Figure 4.2.4: Changes made to the text of transcripts, prior to approval, submitted
using: a) the first iteration of the Transcription Desk, 1 October 2012 to 14 July 2013
(blue); b) the second iteration of the Transcription Desk, 15 July 2013 to 27 June 2014
(red); c) TSX (green)
29
Figure 4.2.5: the final version of TSX (under development), with a WYSIWYG interface 30
Figure 4.3.1: Top ten countries from which TSX was accessed, showing the percentage
of overall active sessions
31
5
Figure 4.3.1: key findings from data pertaining to user interactions with TSX 33
Figure 4.5.1: average cost of checking transcripts submitted using the Transcription
Desk and TSX, when checked by three grades of staff
36
Figure 4.5.2: potential cost-avoidance offered by Transcribe Bentham 37
Figure 4.5.3: cost-avoidance potentially offered by TSX, assuming that 61,110
manuscript pages were transcribed by users via TSX
37
6
1. Introduction
In this section, we present some background information about the tranScriptorium project,
along with some details of Work Package 6 (WP6). In doing so, we also elaborate on the
objectives of each of WP6’s tasks.
1.1 Background
The tranScriptorium Project aims to develop innovative, efficient and cost-effective solutions
for the indexing, searching and full transcription of historical handwritten document images,
using modern, holistic HTR technology. The project will turn HTR into a mature technology by
addressing the following objectives:
1. Enhancing HTR technology for efficient transcription.
2. Bringing the HTR technology to users: individual researchers with experience in
handwritten documents transcription and volunteers who collaborate in large transcription
projects.
3. Integrating the HTR results in public web portals: the outcomes of
the tranScriptorium tools will be attached to the published handwritten document images.
1.2 WP6 Tasks and Status
WP6 consists of the following tasks and objectives [1]:
T6.1: User Needs (UIBK, ULCC. Led by UCL)
User needs were analysed for the two scenarios considered in tranScriptorium:
Crowdsourced transcription
Content providers (archives and libraries), and how these institutions can support
scholarly and public users
The full report of these evaluations can be found in D6.1. Though this task has been completed,
feedback from users was continually sought and acted upon during the remainder of the
tranScriptorium programme, in order to ensure that the platforms developed continued to meet
7
the needs of their users. Please see subsequent sections for discussion of how this ongoing
feedback impacted on the development of TSX.
T6.2: The Crowdsourcing Platform (UPVLC, NCSR, UCL, INL, ULCC. Led by ULCC)
The task covers the design, development, implementation and testing of solutions for
incorporating the DIA and HTR technology into a crowdsourced transcription platform.
Initial prototypes were based around a customised version of the MediaWiki-based
‘Transcription Desk’ platform, developed for the Transcribe Bentham initiative.
Following further development and user feedback and testing, TSX, a lightweight client
integrated into the Transkribus infrastructure, was instead developed.
Manuscript material suitable for crowdsourcing was selected during the course of Task
2.1. These images and word graphs were uploaded to TSX for a period of beta testing,
and to ensure the full functionality of the platform prior to public launch. Modifications
and improvements were made in the light of testers’ recommendations and feedback.
For public launch, a further 1,500 manuscripts from UCL’s Bentham Papers were made
available for crowdsourced transcription. They were first uploaded to the Transkribus
server, and there subjected to semi-automated document image analysis (DIA) in order
to identify baselines. Obtaining baselines is a prerequisite for accurate HTR. Word
graphs were applied to the baselined images, providing users with transcription
scenarios which incorporated HTR support. (See Section 4.2 for a full description of this
workflow).
T6.3: Crowdsourcing HTR (UCL, ULCC. Led by UCL)
TSX, the HTR crowdsourcing interface, was launched to the public in March 2015, later than
originally envisaged in the tranScriptorium proposal. This was owing to staffing changes at
ULCC during the second year of the programme, and a subsequent redevelopment of the
platform. (See D6.2.1, and D6.2.2).
The running of crowdsourced transcription of TSX, and its evaluation, has consisted of the
following components:
8
The day-to-day running of the crowdsourcing interface.
Provision of training materials for users to explain the platform, and how to use HTR
technology, if so desired.
DIA of manuscript images and correction of automatically-generated baselines.
Gathering feedback from users.
Supporting users.
Quality control of submitted transcripts.
Publicising the project.
Evaluation of the TSX platform more generally.
Evaluation of the potential of incorporating HTR into a crowdsourced transcription
initiative.
T6.4: HTR at Content Provider Portals (UPVLC, UIBK, NCSR, INL. Led by UIBK)
Based on the concept created in year 1 of the project (described in D6.2.2) UIBK extended the
original approach and developed a comprehensive platform (Transkribus) which meets the
needs of content providers in three main ways:
1. As foreseen in the DoW, the HTR technology can be integrated with a minimum of effort
into a Content Provider Platform firstly by exploiting the export formats offered by the
platform, and secondly by using the web services for accessing all documents of the
platform via a standardized interface.
2. In addition to the original concept, Content Providers are also able to upload their own
documents to the platform and to process them via the Crowd-Sourcing interface TSX
which is described in detail in this report.
3. In order to support Content Providers in managing the processing of documents the
complete technological basis for this task was built and comprises user and document
management, the integration of HTR, DIA and KWS services into the platform and an
expert tool for managing and supervising the whole process.
T6.5: Evaluation (UCL, UIBK, ULCC. Led by UCL)
TSX and the HTR crowdsourcing were evaluated using both quantitative and qualitative metrics,
allowing for conclusions to be drawn about the potential benefits of introducing HTR
technology into a crowdsourced transcription project, and the cost-effectiveness of doing so.
The key statistics recorded to carry out this evaluation were:
9
1. The number of transcripts worked on by users.
2. The number of alterations made to the text of submitted transcripts before being
accepted by expert checkers.
3. The number of alterations made to the TEI mark-up of submitted transcripts before
being accepted by expert checkers.
4. The Word Error Rate of each transcript.
5. The time spent by an expert checking and accepting each transcript.
6. The time spent by the user in transcribing the manuscript.
7. User interactions with TSX.
In summary, this report evaluates the benefits and current functionalities of the tranScriptorium
HTR tools within the complementary Transkribus and TSX tools.
Transkribus, the Content Provider Platform, was developed by UIBK, NCSR, and UPVLC. It is
intended as a tool for expert users (professional transcribers, scholars, archivists), through
which content is uploaded and exported by this user group.
TSX, the crowdsourcing platform, was developed by ULCC, UCL, UPVLC, and UIBK. It takes
advantage of the Transkribus infrastructure, allowing expert users to straightforwardly expose
their documents to non-specialist users, namely the general public. It supports users of varying
levels of transcription skills and expertise, via a simplified though still sophisticated interface,
and allows users to take advantage of HTR technology in their work. These users work with
specific precompiled collections.
The two user interfaces also differ in their nature. Transkribus requires a download and
presents the user with rich features for transcribing, annotating, tagging, and applying DIA tools
to uploaded documents in a restricted access environment. TSX, meanwhile, is an open-access
web-based client, acting as an overlay to Transkribus. It has transcription functionalities which
are open to any potential user, after registration.
Both platforms, together, cover the following functions:
a. Transcription from image
b. Initiate HTR of image (available at TD)
c. Correction of existing transcription from HTR or other transcriber
d. Interactive transcription (CATTI)
10
e. Suggestions from lexicon and/or LM and/or word graph
f. User management and access control
g. Uploading data (import)
h. Export and conversion to distribution formats
i. Manual DIA and line segmentation
j. Correction of DIA and line segmentation
k. Interactive and/or manual DIA and line segmentation
l. Initiate training of HTR
From the list above, TSX currently presents full functionality in categories a, c, e, and h, and
partial functionality in category d. The remaining functionalities are present in the Transkribus
administrative infrastructure.
11
2. Crowdsourcing HTR: TSX
The crowdsourcing platform was initially conceived of as a customized version of the
MediaWiki-based ‘Transcription Desk’ (TD) platform, which was itself developed by ULCC for
UCL’s award-winning Transcribe Bentham initiative. For a full description of the TD-based
prototypes developed for crowdsourcing with HTR, please see D6.2.1, and D6.2.2.
However, a TD-based solution was found to ultimately not be effective for both implementing
the various aspects of the HTR technology, and for delivering them to users. Local document
and transcription management meant that there was a significant overhead in managing the
integration of HTR outputs into a TD-based solution. ULCC instead developed the lightweight,
fully customizable crowdsourcing platform known as TSX, which serves as an overly to
Transkribus assets, and accesses UIBK’s Transkribus server to manage resources. As a result,
TSX is able to utilize the standard forms of metadata used across the project, whether that
relates to manuscript images, transcriptions, or document and user-management metadata.
This has the added, highly notable advantage, of significantly easing the process of integrating
the crowdsourcing platform with other tools both now, and with future projects and initiatives
in mind.
TSX utilises three key resources sourced from the Transkribus server.
1. The page image. This is presented to the user in a zoomable panel (using the Raphael.js
image library (http://raphaeljs.com/)).
2. The transcript area. A transcript (if available) and DIA are encoded in PAGE XML. This
is retrieved from the Transkribus, and processed as such that the transcript is presented
in an editable text area (managed using the codemirror text editor
https://codemirror.net). Polygon co-ordinates from the PAGE XML are used to highlight
the line currently in focus in the transcription area.
3. The wordgraph. A pre-processed wordgraph is imported for each line of the
manuscript. This provides users with a full best-hypothesis transcript, or word and/or
Overall, on average a transcript submitted by a user via the Transcription Desk required 3
alterations to its text, and 5 to its TEI mark-up, before it was approved by a Transcribe Bentham
staff member. It took, on average, 207 seconds (3 minutes and 27 seconds) to check and
approve a transcript. (See Figure 4.1.2).
An improved version of the Transcription Desk was introduced on 15 July 2013. The key change
to the platform was the introduction of a tabbed user interface, designed to assist users in better
understanding the working of the TEI mark-up, and thereby reduce the number of errors they
made when applying it to their transcripts. As most of the time spent checking transcripts is
expended on the TEI mark-up, it was also hoped that the quality-control process would become
more efficient as a result of there being fewer TEI mark-up errors in the transcripts.3 Owing to
the introduction of this second iteration of the Transcription Desk it is, therefore, helpful, to
divide the overall recording period into two separate periods, namely i) 1 October 2012 to 14
July 2013, or Period A, in which users transcribed using the first iteration of the Transcription
Desk; and ii) 15 July 2013 to 27 June 2014, or Period B, in which users transcribed using the
second iteration of the Transcription Desk. (See Figure 4.1.2).
As can be seen in Figure 4.1.2, there are two significant differences between Period A and
Period B, largely owing to the improvements made in the second iteration of the Transcription
Desk. First, the average time in which a transcript was checked and approved was reduced from
364 seconds (6 minutes and 4 seconds) to 141 seconds (2 minutes and 21 seconds). Second, and
directly connected to this improved efficiency, was a halving of the average number of
alterations required to the TEI mark-up of each transcript. Period B represents what we might
consider the ‘state-of-the-art’ when it comes to Transcribe Bentham, and data from this period
will therefore be considered as the most relevant point of comparison with transcripts
submitted via TSX.
3 For a full description of this improved iteration of the Transcription Desk, see T. Causer and M. Terras,
‘“Many hands make light work. Many hands together make merry work”: Transcribe Bentham and
crowdsourcing manuscript collections’, in Crowdsourcing Our Cultural Heritage, ed. M. Ridge (Ashgate,
2014), pp. 57–88.
25
Period Average number of words per transcript, excluding mark-up
Average number of words per transcript, including mark-up
Average time spent checking and approving a transcript (seconds)
Average no. of alterations to text of transcript
Average no. of alterations to TEI mark-up
1/10/12—27/6/14 (Overall)
271 371 207 3 5
1/10/12—14/7/13 (Period A)
325 456 364 4 8
15/7/13—27/6/14 (Period B)
248 336 141 3 4
Figure 4.1.2: Volume of work carried out by users for Transcribe Bentham, 1 October 2012 to 27 June 2014, showing the overall data, and for both iterations of the Transcription Desk.
4.2. Comparison of Transcribe Bentham and TSX As can be seen in Figure 4.1.2, the checking of transcripts submitted using TSX was more
efficient, on average, than when checking transcripts submitted using either version of the
Transcription Desk. A greater percentage of TSX transcripts (72%) took from 31 to 180 seconds
to check than in either the first iteration (20%) of the Transcription Desk, or the second (60%).
However, no TSX transcripts were checked in 30 seconds or less.
Overall, it took an average of 129 seconds (2 minutes and 9 seconds) to check a TSX transcript.
It was also slightly quicker on average to check a TSX transcript than one submitted using the
second iteration of the Transcription Desk (141 seconds, or 2 minutes and 21 seconds), when
Transcribe Bentham was at its most efficient. It is particularly noteworthy that TSX transcripts
could be checked more quickly than those submitted using the Transcription Desk, despite the
former requiring a greater number of alterations to their text, and often a greater number of
alterations to their TEI mark-up, before being approved than the latter.
The key factors for the efficiency of the quality-control process for TSX transcripts were, in the
first instance, the segmentation of the images into lines. In Transcribe Bentham, transcripts are
entered into a plain-text box and the individual transcriber, to a great extent, decides upon how
they will lay out their transcripts, with the TEI mark-up being a particular complicating factor.
Some users, for instance, add line-break tags at the end of each line, e.g.
<p>The day before yesterday arrived here the 4 <add>Newcastle</add> people viz. 1 The
millwright<lb/>
2 the Joiner, <del>3 The Heckler</del> <add>4 The Sailor</add> and with them Roebuck the
Gardener<lb/>
26
and his female companion. Notman<hi rend=“superscript>'s</hi> four acquaintance I like
exceedingly<lb/>
Transcripts laid out in this manner are much easier to check against the original manuscript.
Other users, typically more experienced, add their TEI mark-up in-line with the text, e.g.
<p>The day before yesterday arrived here the 4 <add>Newcastle</add> people viz. 1 The
millwright<lb/> 2 the Joiner, <del>3 The Heckler</del> <add>4 The Sailor</add> and with them
Roebuck the Gardener<lb/> and his female companion. Notman<hi rend=“superscript>'s</hi> four
acquaintance I like exceedingly<lb/>
These latter transcripts can be rather challenging to check more quickly. In TSX, the line
segmentation ensures that the user knows precisely what to transcribe for each particular line,
and checking transcripts on a line-by-line basis is a much more straightforward task for a
project Administrator. Another factor which contributes to the more efficient checking of TSX
transcripts is that they were, on average, shorter in length than those submitted using the
Transcription Desk. (See Figure 4.2.1 for comparison). TSX transcripts were an average of 204
words in length (not including the TEI mark-up), and 225 in length (including the TEI mark-up),
and this will, in part, have contributed to the more efficient checking time.
Platform No. of transcripts
Average time spent checking a transcript
Average no. of changes to text
Average no. of changes to mark-up
Transcription Desk, 1/10/12—27/6/14 (Overall)
4,364 207 seconds 3 5
Transcription Desk, 1/10/12—14/7/13 (Period A)
1,288 364 seconds 4 8
Transcription Desk 15/7/13—27/6/14 (Period B)
3,076
141 seconds 3 4
TSX 101 129 seconds 6 7
Fig. 4.2.1: Outline comparison of the quality of transcripts submitted via the Transcribe Bentham Transcription Desk (overall, and during Period B), with those submitted via TSX
27
Figure 4.2.2: Time spent checking transcripts submitted using a) the first iteration of the Transcription Desk, 1 Oct 2012 to 14 July 2013
(blue); b) the second iteration of the Transcription Desk, 15 July 2013 to 27 June 2014 (red); and c) TSX
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
1 to30
31 to60
61 to90
91 to120
121to
150
151to
180
181to210
211to
240
241to
270
271to
300
301to
330
331to
360
361to
390
391to
420
421to
450
451to
480
481to
510
511to
540
541to
570
571to
600
601to
780
781to
960
961to
1500
1501to
2500
2501to
4000
Pe
rce
nta
ge
Seconds
1st iteration 2nd iteration TSX
28
One measure of the quality of user transcripts is the number of alterations made by the
Administrator; the fewer the alterations made, the greater the quality of the transcripts. By this
metric, at first glance TSX transcripts do not compare all that favourably with transcripts
submitted using the Transcription Desk. (See Figure 4.2.1). Overall, TSX transcripts required 6
alterations to their text before being approved by an Administrator. Though this is an excellent
standard, the data regarding the quality of TSX transcripts may be slightly distorted by the
presence of 10 transcripts which required from 17 to 42 alterations each to their text, typically
owing to where the user had failed to transcribe a portion of the manuscript (most commonly
pencil marginalia). If these ten transcripts are excised from the data, then the average number
of alterations required to the text of a TSX transcript drops to 3.
Taking the number of errors per thousand words, TSX transcripts seem to of a significantly
lesser quality than those submitted using the Transcription Desk (though that users made only
30 errors per thousand words still indicates that TSX transcripts are of a very high quality.
However, removing the ten TSX transcripts requiring from 17 to 42 alterations to their text
causes the error rate to drop to 15 errors per thousand words. This finding not only highlights
the distortion to the data caused by these ten transcripts and that there is a need to gather more
data, but also the fact that TSX transcripts are of comparable standard to those submitted using
the Transcription Desk.
Platform Errors in text (per thousand
words)
Errors in text plus TEI
mark-up (per thousand
words)
Transcription Desk (overall) 11 13
Transcription Desk (1st
version)
13 18
Transcription Desk (2nd
version)
10 10
TSX 30 30
Figure 4.2.3: Errors per thousand words, comparing transcripts submitted using the
Transcription Desk and TSX
29
Figure 4.2.4: Changes made to the text of transcripts, prior to approval, submitted using: a) the first iteration of the Transcription Desk, 1
October 2012 to 14 July 2013 (blue); b) the second iteration of the Transcription Desk, 15 July 2013 to 27 June 2014 (red); c) TSX (green)
TSX transcripts also required an average of 7 alterations to their TEI mark-up before being
approved. The most common errors were in users failing to add structural mark-up such as
headings or paragraph tags, or placing them incorrectly. It would appear that users assume,
thanks to the segmentation of the manuscript images into lines, that such mark-up is
superfluous. Users did, however, add TEI mark-up to indicate features such as deleted or
underlined text, though interlineations found on their own lines often did not have addition tags
applied to them.
The tranScriptorium consortium has concluded that it is undesirable for TSX users to add TEI
mark-up to transcripts (which are stored on the Transkribus server). The final version of TSX
will therefore incorporate a What-You-See-Is-What-You-Get (WYSIWYG) interface, where the
mark-up is hidden from view. Issues surrounding the time spent checking the TEI mark-up may
shortly be rendered academic. Moreover, if the project Administrator does not have to check
TEI mark-up for accuracy, they can then concentrate on ensuring that the text is accurate and
the efficiency of the quality-control process as a whole will be further increased. Streamlining
the quality-control process is a particularly important consideration if others are to be
convinced of the practicalities of utilizing Transkribus and TSX for crowdsourcing. Checking that
the text of a transcript is task enough for project administrators to deal with, and the final
version of TSX, and Transkribus, will meet these needs.
Figure 4.2.5: the final version of TSX (under development), with a WYSIWYG interface. Note the deletion of ‘prophet’ at the end of line nineteen, highlighted in the manuscript image.
31
4.3. TSX statistics, and user interactions
Data derived from Google Analytics shows that from 20 March to 6 December 2015, there were
4,228 active sessions on TSX, by 3,451 individual users. TSX has, therefore, attracted a great deal
of attention.4 This attention has not, however, been converted into a great deal of users as only
74 individuals appear to have signed up to TSX. It is, however, difficult to tell just how many
users have registered with TSX, as registering with Transkribus also automatically registers you
with TSX. There have also been technical issues with TSX which may have frustrated some. The
Google Analytics report also reveals that 71% of all users accessing TSX have done so using Mac
OS, with which TSX is currently incompatible.
TSX has been accessed from 98 countries around the world. The top ten countries from which
TSX was accessed were as follows:
Country from which TSX was accessed Percentage of overall active sessions (4,228)
United States 28.3
Unknown location 19.7
United Kingdom 12.2
China 4.2
Spain 3.2
Japan 2.8
Austria 2.7
Russia 2.7
Germany 2.5
South Korea 2
Figure 4.3.1: Top ten countries from which TSX was accessed, showing the percentage of
overall active sessions
To record user interactions with TSX, ULCC produced a script which records data for each user
session and sends it to a .CSV file. The key pieces of data in question are:
4 For example, see M. Ridge, ‘How an ecosystem of machine learning and crowdsourcing could help you’,
http://www.openobjects.org.uk/2015/08/ecosystem-machine-learning-crowdsourcing/, last accessed 9 December
bodies—ever governed by budgets and bottom lines—are to be persuaded to support such
(potentially) valuable initiatives.
UCL has analysed the cost-efficiency of Transcribe Bentham in great detail. Before beginning this
discussion, any analysis must take into account the £600,000 or so invested in Transcribe
Bentham by the Arts and Humanities Research Council and the Andrew W. Mellon Foundation.
About £192,000 of this money was spent on digitising the Bentham Papers, and about £80,000
on software development. The remainder was spent on storage, equipment, and academic
salaries. So, while establishing and developing Transcribe Bentham did not come cheaply, the
investment is likely to pay off in the long term, as will be subsequently discussed. Moreover,
institutions wishing to crowdsource transcription of their own material can now take advantage
of the freely-accessible code for the Transcription Desk, a tried-and-tested platform for
collaborative transcription. This could help to significantly mitigate start-up costs, although
implementation and customisation of the Transcription Desk would require a certain level of
resources. Utilising Transkribus and TSX to launch and manage a crowdsourcing project, and
using the Transkribus infrastructure to do so, would also negate the need to pay installing and
running a local solution, and further reduce the costs of such a programme.
Transcribe Bentham, and crowdsourced transcription more generally, can offer significant cost-
avoidance potential. This cost avoidance can best be seen when comparing the cost of
researchers transcribing manuscripts, against the cost of researchers checking volunteer-
submitted transcripts. It is estimated that around 100,000 page transcripts will be required
before the UCL and British Library Bentham Papers are fully transcribed. If a Senior Research
Associate (UCL Grade 8, national spine point 38) were employed to transcribe the estimated
61,110 manuscript pages requiring transcription as of 30 September 2014, this would cost a
minimum of £1,121,063, including on-costs (that is, National Insurance and superannuation
contributions, and so the total cost of employing a Senior Research Associate). This is on the
assumption that it would take an average of 45 minutes to transcribe a manuscript, and at an
average cost of £18.35 per transcript. It also assumes that a funding body or bodies would be
willing to provide money purely to fund transcription for a number of years which is, to say the
least, a forlorn hope.
As noted in Figure 4.1.2, by the close of the end of Period B it took an average of 141 seconds to
check and approve a transcript. This works out at around £0.97 of a Senior Research Associate’s
time, including on-costs. If the checking task were delegated to a Transcription Assistant (UCL
Grade 5 Professional Services staff, national spine-point 15), then the cost of checking the
36
average Period B transcript would be approximately £0.52, including on-costs.5 If hourly-paid
graduate students (UCL Grade 4, Professional Services staff, national spine point 11)6 were
given the task, then the average Period B transcript could be checked for about £0.44. These
calculations do, of course, assume that the people at each of these grades have appropriate
levels of experience and expertise, and that it would take them the same amount of time to
check the average transcript. These are, then, ‘best case’ scenarios, as it may be that an hourly-
paid graduate student might take a little longer to check a transcript than either a Transcription
Assistant or a Senior Research Associate.
As a TSX transcript can be checked more quickly than one submitted using the Transcription
Desk—129 seconds for the former, 141 seconds for the latter—then the average cost of
checking a TSX transcript is slightly lower. Checking a TSX transcript would take £0.88 of a
Senior Research Associate’s time (including on-costs), £0.47 of a Transcription Assistant’s time
(including on-costs), and £0.40 of an hourly-paid graduate student’s time.
If we make the assumption that the 61,110 manuscript pages requiring transcription were
transcribed by users through TSX, and were then checked by staff at the three levels, then the
cost-avoidance potential is also slightly greater than that offered by Transcribe Bentham.
However, all of these calculations assume that the staff checking the transcripts also check the
TEI mark-up; the elimination of this task from the quality-control process will further reduce
the average checking time per transcript, and translate into further cost-avoidance.
It should be stated that in all the above discussion, and in the figures in this section, that the
average cost of checking a transcript, and in estimating the overall cost-avoidance potential of
Transcribe Bentham, the management of users, maintenance of the Transcription Desk,
publicity, updating of project statistics, and generation of TEI XML versions of the transcripts (a
manual process in Transcribe Bentham), are not taken into account. A number of these
processes will become automated using the Transkribus and TSX infrastructure, such as the
facility to automatically export TEI XML versions of transcripts using Transkribus.
Transcripts checked by Average cost of checking a transcript (Transcribe Bentham)
Average cost of checking a transcript (TSX)
Senior Research Associate £0.97 £0.88 Transcription Assistant £0.52 £0.47 Hourly-paid graduate student £0.44 £0.40 Figure 4.5.1: average cost of checking transcripts submitted using the Transcription Desk and TSX, when checked by three grades of staff
5 A Transcription Assistant would, typically, be a graduate student.
6 On-costs are not applicable to hourly-paid staff.