Top Banner
Optical Layout Recognition (OLR) From unstructured to structured newspaper data Claus Gravenhorst, CCS Content Conversion Specialists GmbH ENP information day, Paris, November 27, 2014
14
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Presentation of Claus Gravenhorst, BnF Information Day

Optical Layout Recognition (OLR)

From unstructured to structured newspaper data

Claus Gravenhorst, CCS Content Conversion Specialists GmbH

ENP information day, Paris, November 27, 2014

Page 2: Presentation of Claus Gravenhorst, BnF Information Day

Agenda

• About CCS

• General OLR-workflow for mass digitization

• Layout and structure analysis

• ENP OLR workflow

• Quality assurance

• Output – METS/ALTO package

• Use of structural data – Access and presentation

Page 3: Presentation of Claus Gravenhorst, BnF Information Day

About CCS

• CCS Content Conversion Specialists GmbH (Hamburg), as technical project partner, will provide its expertise and docWorks technology to set up and operate a mass digitization workflow for creating high quality structured content from 2 million scanned newspaper pages provided by 5 library partners

• Page volume:

BNF=1.000 k, NLE=500 k , SUB HH=480 k, NLF=90 k, SBB=10 k

• The distributed OLR workflow enables the contribution of project partners (content providers) to the integrated quality assurance process

• CCS is also contributing to the specification of the ENMAP metadata model

Page 4: Presentation of Claus Gravenhorst, BnF Information Day

General workflow for mass digitization

Re-Scan

Conversion

Imaging

Layout Analysis

OCR

ISR

Reject Condition

DeliveryQA

random

Final Output

ScanningImage

Metadata

Database

----------------

Repository

• Automated QA

DocumentUID

Barcode

Item Tracking

Manual QA

•in-house•near-shore•off-shore•multiple locations

Manual QA

•in-house•near-shore

Check in

Check out

Scanner

•Robot-

•Book-

•Document-

•Microfilm-

QA+CorrectionQA+Correcti

onQA +

Correction

Z 39.50Metadata

Page 5: Presentation of Claus Gravenhorst, BnF Information Day

Layout and structure analysis

• Layout analysis based on „bottom up“ approach

• General rule system enables recognition of words, text lines, text blocks, columns and classification of text blocks, illustrations, advertisements, tables and the following page types:

- title page (the title page of an issue) - content page (a page that consists of content/text only) - illustration page (a page that has at least one illustration) - advertisement page (a page that contains adverts only)

• Structure analysis through classification of headlines and grouping of zones into articles

(incl. article continuation)

Page 6: Presentation of Claus Gravenhorst, BnF Information Day

ENP OLR workflow | Conversion without scanning

•Digital Image•MetadataDelivery

•Digital Image•MetadataDelivery

•Digital ObjectReturn

•Digital ObjectReturn

Inspection / Automatic QAInspection /

Automatic QA

•Doc Delivery•Doc Delivery

RejectReject

Conversion facility

Material location

Conversion

MD Recording

Page 7: Presentation of Claus Gravenhorst, BnF Information Day

Possible conversion scenarios

A) Conversion at library (on-site)

B) Conversion off-shore at CCS data center,final QA at the library via internet transfer (remote QA solution)

C) Conversion off-shore at CCS,final QA at the library by backup shipment

Page 8: Presentation of Claus Gravenhorst, BnF Information Day

Scenario B | Remote QA at library

Internet

StorageStorage

IN

OUTPOOL

dW Share

Master

OffshoreProcessing

@ CCS

OUTPUT

METS ALTO

StorageStorage

POOL

dW Share

RQA

QA on-site @ Library

INPUT

Page 9: Presentation of Claus Gravenhorst, BnF Information Day

Quality assurance

• @ CCS | Automated markup and basic manual correction: - Headlines, illustrations, tables, captions, advertisements, etc. - Article segmentation and grouping of zones into articles (incl. continuation)

• @ Content Provider (Library)

Recommended: - Zoning: correct classification of blocks as „text“ or „illustration“ - Article segmentation: correct identification of headlines/text blocks/captions - Grouping: correct grouping of blocks (text, illustration) to articles - Metadata: correct title, issue date and issue number

Optional: - Page types: correct page types - Page numbers: correct page sequence - OCR: perform text correction of specific zones (e.g. headlines, captions)

Page 10: Presentation of Claus Gravenhorst, BnF Information Day

Output | METS/ALTO package

• METS/ALTO metadata schemas to describe the structured digital ouput object

• A newspaper issue processed in docWorks is converted into one METS XML file. It reflects the whole physical and logical structure, manages all links to the image files and the related ALTO XML files. ALTO is based on a standardized page description schema and contains all information of a page (print space, margins, coordinates, OCR results).

• Benefits of structural markup:

- better browsing and more precise text search

- better access and display on tablet and mobile devices - automated article classification and clustering through data/text mining and linguistic technologies - user engagement for manual online text correction, article classification, annotation, building personal collections, etc. - sharing articles via social media platforms like Facebook, Twitter, etc. _______________ METS = Metadada Encoding and Transmission Standard

ALTO = Analyzed Layout and Text Object

Page 11: Presentation of Claus Gravenhorst, BnF Information Day

Access and Presentation (I)

• Sample presentation system (Veridian)

• Browse by date, title

• Text search

• Article hit list

• Word highlighting

Page 12: Presentation of Claus Gravenhorst, BnF Information Day

Access and Presentation (II)

• Issue

• Table of contents

Page 13: Presentation of Claus Gravenhorst, BnF Information Day

Access and Presentation (III)

• Text & image view

• User text correction

• Article clipping

• Print article

• Distribute via email and social media platforms

Page 14: Presentation of Claus Gravenhorst, BnF Information Day

Thank you for your attention!

[email protected]

www.europeana-newspapers.eu