Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro H. Toselli [email protected]Pattern Recognition and Human Language Technology Reseach Center Universitat Polit` ecnica de Val` encia Spain June 2015 Assisted HTR Assisted HTR handwritting KWS Index 1 Handwritten Text Recognition (HTR) and Indexing 1 2 The tranScriptorium Project 3 3 Selected Handwritting Datasets 5 4 Interactive HTR: Transcription Demonstration 11 5 HTR and Interactive-Predictive HTR Results 14 6 Handwritten Text Images Indexing: Search Demonstration 17 7 Handwritten Text Images Indexing and Search Results 20 8 Conclusion 23 A.H. Toselli – PRHLT/UPV Page 1
14
Embed
Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Interactive Handwritten Text Recognition and Indexingof Historical Documents: the tranScriptorum Project
6 Handwritten Text Images Indexing: Search Demonstration . 17
7 Handwritten Text Images Indexing and Search Results . 20
8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 3
Assisted HTR handwritting KWS
The tranScriptorium Projecthttp://www.transcriptorium.eu
• STREP of the FP7 in the ICT for Learning and Access to CulturalResources challenge (1 January 2013 to 31 December 2015)
• tranScriptorium aims to develop innovative, efficient and cost-effectivesolutions for the indexing, search and full transcription of historicalhandwritten document images, using modern, holistic Handwritten TextRecognition technology
1. Enhancing HTR technology for efficient transcription2. Bringing the HTR technology to users3. Integrating the HTR results in public web portals
Supported by: EU Cultural Heritage:A.H. Toselli – PRHLT/UPV Page 4
Assisted HTR handwritting KWS
Index
1 Handwritten Text Recognition (HTR) and Indexing . 1
6 Handwritten Text Images Indexing: Search Demonstration . 17
7 Handwritten Text Images Indexing and Search Results . 20
8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 11
Assisted HTR handwritting KWS
Interactive HTR: CATTI operation example
x
STEP-0 p
s ≡ w antiguas cuidadelas que en el Castillo sus llamadas p′ antiguas cuidadelas que en el Castillo sus llamadas
STEP-1 κ antiguos cuidadelas que en el Castillo sus llamadas p antiguos ciudadanos que en el Castillo sus llamadas s antiguos ciudadanos que en el Castillo sus llamadas p′ antiguos ciudadanos que en el Castillo sus llamadas
STEP-2 κ antiguos ciudadanos que en Castilla sus llamadas p antiguos ciudadanos que en Castilla se llamaban s antiguos ciudadanos que en Castilla se llamaban p′ antiguos ciudadanos que en Castilla se llamaban
FINAL κ antiguos ciudadanos que en Castilla se llamaban #p ≡ T antiguos ciudadanos que en Castilla se llamaban
Post-editing WER: 6/7 (86%)Interactive WSR: 2/7 (29%, assuming a whole-word correction in step-1)Estimated effort reduction: 1− 29/86 (66%).
A.H. Toselli – PRHLT/UPV Page 12
Assisted HTR handwritting KWS
Interactive HTR: Transcription Demonstration
• It is just a “demo” ! not intended for real operation (other systems do that)
• Everything is real. No tricks to make demo look better than real
• Web client-server architecture: Web browser front-end, back-end serverproviding off-line HTR-CATTI
• Off-line HTR-CATTI decoder based on word graphs
• Three tasks:
– BENTHAM: 10K words open vocabulary
– AUSTEN: 78K words external, open vocabulary from Bentham texts
20K words external, open vocabulary from Austen texts
A.H. Toselli – PRHLT/UPV Page 13
Assisted HTR handwritting KWS
Index
1 Handwritten Text Recognition (HTR) and Indexing . 1
◦ 5 HTR and Interactive-Predictive HTR Results . 14
6 Handwritten Text Images Indexing: Search Demonstration . 17
7 Handwritten Text Images Indexing and Search Results . 20
8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 14
Assisted HTR handwritting KWS
HTR and Interactive-Predictive HTR
HTR current state-of-the-art:
• Segmentation-free approach: no explicit segmentation of text images intowords or characters is required
• The basic input unit is a handwritten text line image
• Statistical modeling at different perception levels:
– Optical (character shape), using Hidden Markov Models (HMMs)– Lexical, by means of finite-state character representation of words– Syntactical, based on statistical language models, such as N -grams
Interactive-predictive framework: rather than full transcription automation, thesystem assists the human transcriber
• Combines HTR efficiency with the accuracy of human experts, leading tocost-effective perfect transcripts
◦ 6 Handwritten Text Images Indexing: Search Demonstration . 17
7 Handwritten Text Images Indexing and Search Results . 20
8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 17
Assisted HTR handwritting KWS
Handwritten Text Images Indexing and Search
• There are massive text image collections out there, but their textualcontent remains practically inaccessible
• If perfect or sufficiently accurate text image transcripts were available,image textual context could be straightforwardly indexed for plaintexttextual access.
• But fully automatic transcription results lack the level of accuracy neededfor useful text indexing and search purposes
• And manual or even interactive-predictive assisted transcription isentirely prohibitive to deal with massive image collections
• Good news: indexing and search can be directly implemented on theimages themselves, without explicitly resorting to any image transcripts,as we will see now.
A.H. Toselli – PRHLT/UPV Page 18
Assisted HTR handwritting KWS
Handwritten Text Images Indexing and Search: Demonstration
• It is just a “demo” ! not (yet) intended for real operation. But everythingis real – no tricks to make demo look better than real
• Line-level indexing according to the precision-recall trade-off model :Rather than exact searching, search is carried out with a confidencethreshold, specified by the user as part of the query in order to meetthe required precision-recall trade-off
• Word confidence scores are based on pixel-level probabilities andcomputed for line-shaped regions. Spotted word positions are markedonly approximately
• Two tasks:
– AUSTEN: Trained on Austen (50p), 20K words open vocabulary.Demo on the whole “Juvenile volume The Third” (128 pages)
– PLANTAS: Trained on Plantas (224p), 21K words open vocabulary.Demo on Volume I (about 1 000 pages)
A.H. Toselli – PRHLT/UPV Page 19
Assisted HTR handwritting KWS
Index
1 Handwritten Text Recognition (HTR) and Indexing . 1
6 Handwritten Text Images Indexing: Search Demonstration . 17
◦ 7 Handwritten Text Images Indexing and Search Results . 20
8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 20
Assisted HTR handwritting KWS
Indexing and Search for Handwritten Text Images:Pixel-level Posteriorgram
P
X
Pixel-level posterior probabilities P for a text image X and word v =”matter”.
An accurate, contextual (n-gram based) word classifier was used to compute P . Thishelped to achieve very low posteriors in a region of X around (i=100, j=200), where avery similar word, “matters”, is written.
A.H. Toselli – PRHLT/UPV Page 21
Assisted HTR handwritting KWS
Results on tranScriptorium Data Sets
Average Precision (AP)Mean Average Precision (MAP)and Recall-Precision curves
6 Handwritten Text Images Indexing: Search Demonstration . 17
7 Handwritten Text Images Indexing and Search Results . 20
◦ 8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 23
Assisted HTR handwritting KWS
Conclussions
• Automatic or assisted handwritten text transcription and fullyautomatic indexing is now becoming perfectly feassible
• Models trained for a given collection can provide quiteuseful performance on images from other similar collections,without need of (re-training)
• Several demonstrators have been implemented and madepublicly available for first-hand experience in real use; see:http://transcriptorium.eu/demonstrations