IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Interoperability and Evaluation Framework Clemens Neudecker, National Library of the Netherlands IMPACT Demo Day, British Library 12/11/11
20
Embed
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
Slides from Clemens Neudecker's presentation on the IMPACT Interoperability and Evaluation Framework within the IMPACT project at the British Library Demo-day on the 12th July 2011.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Interoperability and Evaluation FrameworkClemens Neudecker, National Library of the NetherlandsIMPACT Demo Day, British Library 12/11/11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR: A multitude of challenges…I. OCR challenges (gothic fonts, bleed-through, warping, etc.)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR: A multitude of challenges…II. Language challenges (spelling variants, inflection, and many more!)
Example: historical variants of the Dutch word ‘wereld’ (world):
- Java 6- Generic Web Service Wrapper- Apache Ant/Maven- Apache Tomcat/httpd- Apache Axis2- Apache Synapse- Taverna Workflow Engine
IMPACT Evaluation Framework: Dataset- approx. 5 TB raw data (images, text files, metadata) and growing
- Ground truth transcriptions - Evaluation modules
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Components I: IIF Enterprise Service Bus
receives (SOAP) requests from users and distributes the load to the availableworker nodes
Main effect: Process parallelization,Load distribution,Fail over
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Framework integration Easy to use generic command line wrapper (open source)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow development
OCR workflow = data pipeline
Building blocks =
processing steps (nodes)
Integration = interaction between nodes
(mashup)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow management Web 2.0 style registry: myExperiment Local client: Taverna Workbench Web client: project website API: SOAP/REST
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Community Web2.0 style workflow registry
Community of experts
Sharing of resources
Knowledge exchange
A central meeting point for users and researchers
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Components II: DatasetDatabase and front end, hosted at the PRIMA research group at University of Salford, School of Computing, United Kingdom
- more than 500.000 images from Digital Libraries- more than 50.000 ground truth representations- up to 10.000 direct access calls per month- 4 TB of space and growing
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Dataset Access to a representative and annotated dataset of significant size,
with metadata, ground truth and search facilities
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation features Text based comparison of result with ground truth,
using Levenshtein distance method Layout based comparison of result with ground truth,
using the Page Analysis And Ground Truth Elements Framework Example:
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The PAGE Format FrameworkTwo-level architecture:
– root structure– task specific sub-formats
Separate XML Schema definitions Format identification
via Namespaces Mapping of
– dependencies– process chains– alternative processing steps
Linking via IDsProcessing results or ground truth (e.g. binarisation, dewarping, page content)
PAGE root(XML)
PAGE gts(XML)
PAGE gts(XML)
PAGE gts(XML)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Ground-Truthing Tools Aletheia
FineReader
PAGE Exporter
GT Validator
GT Normalizer
16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
17
Profile ‘Full Text Recognition’
Measure Weights Region Type Weights
Merge Text
Allowable Merge Image
Split Graphic
Allowable Split Chart
Miss Table
Partial Miss Separator
Misclassification Maths
False Detection Noise
1.5
1.0
2.0
2.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.00.5
0.5
Evaluation for general text recognition
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18
Partial MissMiss
Merge
Measures – Segmentation Errors
Split
Ground Truth
Segmentation Result
Mis-classi-fication
Paragraph
Caption
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR Accuracy
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.