Top Banner
On the Two Sides of the Pond By Hans-Jörg Lieder, Head of the Department of Bibliographic Services – Union Catalogue of Serials Staatsbibliothek zu Berlin - Preußischer Kulturbesitz; Dr. Katalin Radics, Distinguished Librarian; Librarian of the West European Collections and Classics Young Research Library, University of California, Los Angeles
19
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: On the two sides of the pond

On the Two Sides of the PondBy Hans-Jörg Lieder,

Head of the Department of Bibliographic Services – Union Catalogue of Serials Staatsbibliothek zu Berlin - Preußischer Kulturbesitz;

Dr. Katalin Radics,Distinguished Librarian; Librarian of the West European Collections and Classics

Young Research Library, University of California, Los Angeles

Page 2: On the two sides of the pond

UNIQUE EUROPEAN MATERIALS – HELD IN A US LIBRARY

Page 3: On the two sides of the pond

Partnershipbetween the UCLA Library and Staatsbibliothek zu Berlin

Page 4: On the two sides of the pond

Newspapers on the way to discoloring and disintegration

Storage facility of the University of California Libraries on the UCLA campus

Page 5: On the two sides of the pond

- Leaflets 13”x18.5” or 33cm x 47cm

- Imprint indicating the title, date, the number of the issue; warning

-Published four or five times a day

Page 6: On the two sides of the pond

UCLA stamps including receiving dates

Packed in wrapping paper probably after 1940, packages of 700-800 sheets

No documentation (ordering or receivingrecords) in the library archives; no correspondence

Normal serial subscription scheme (?)

Very minimal cataloging record – very low use

Page 7: On the two sides of the pond

Towards a Weeding Decision

Brittle conditionCheck for other holdings in California, US and World libraries

OCLC – no other holdings at the time of checking Nine 1938 issues at BNF

No holding at the German National Library (Deutsche Nationalbibliothek)

Contact with head of Zeitungsabteilung, Staatsbibliothek – no holding in Germany

UNIQUE!!! Decision: keep and preserve the UCLA holdings.

Page 8: On the two sides of the pond

Keep and Preserve

9600 pages

1936-1940 with gaps

Acid-free boxes

The most fragile pages in mylar

Page 9: On the two sides of the pond

Digitization Project

Funding for digitization

Highest quality resolution: 600 dpi RGB

Add minimal metadata

Page 10: On the two sides of the pond

Title Deutsches Nachrichtenbüro. 5 Jahrg., Nr. 1581, 1938 October 1, Erste Morgen-Ausgabe Alt ID 3813183_1938-10-01_1581 [Local]AltTitle Erste Morgen-Ausgabe [Descriptive]Deutsches Nachrichtenbüro [Descriptive]Date October 1, 1938 [Publication]1938-10-01 [Normalized]Format 1 p. [Extent]Language ger Name University of California, Los Angeles. Library. Dept. of Special Collections [Repository]Type newspapers [Genre]text [Type Of Resource]

Page 11: On the two sides of the pond

Digitized copies: part of UCLA Digital Library at

http://digital2.library.ucla.edu/ -- freely accessible

Searchable only by date

More sophisticated searching capability needed – day by day chronicle of the Third Reich for a short period of time

-events-names-institutions etc.

Deutsches Nachrichten Büro – December 5, 1933-network of 36 local services (Landesdienste)

Page 12: On the two sides of the pond

Indexing needed

Fraktur – major problem

Transliteration into Latin characters

OCR (Optical Character Recognition) – has to be made in Germany

Looking for a GermanPartner

Page 13: On the two sides of the pond

Not a problem … here we are!

Page 14: On the two sides of the pond

… but who are “we”?• Project: Europeana Newspapers: http://www.europeana-newspapers.eu/

• 18 partners from 12 countries

• Tasks:• Provide OCR for 18 million pages

• Provide OLR for 2 million pages

• Provide NER experimentally in assorted languages

• Provide best practice recommendations for newspaper metadata

• Provide quality prediction tools

• Aggregate content and make it available to TEL and Europeana

OCR = Optical Character RecognitionOLR = Optical Layout RecognitionNER = Named Entities Recognition

Page 15: On the two sides of the pond

A Dance of Acronyms:UCLA, SBB and CCS

UCLA sent data on hard drive

SBB

• Checked data for correctness and moved images into directory structure • Sent data to CCS in Hamburg for OCR and OLR

CCS (Content Conversion Specialists)

• Created full texts per article• Stuck data in NZ web service for preliminary presentation purposes

SBB • Will perform QA of OCR and OLR results• Will provide all data to UCLA for further use• Will present data in ZEFYS, its own newspaper portal; to the Deutsche Digitale Bibliothek; to TEL (The European Library) and to Europeana

Page 16: On the two sides of the pond

Layout and structure analysis recognition of words, text lines, text blocks,

columns and classification of text blocks, illustrations, advertisements, tables and the following page types:

- title page (the title page of an issue) - content page (a page that consists of content/text only) - illustration page (a page that has at least one illustration) - advertisement page (a page that contains adverts only)

Structure analysis through classification of headlines and grouping of zones into articles

(incl. article continuation)

Page 17: On the two sides of the pond

ENP OLR workflow | Conversion without scanning

Digital ImageMetadataDelivery

Digital ImageMetadataDelivery

Digital ObjectReturn

Digital ObjectReturn

Inspection / Automatic QAInspection /

Automatic QA

Doc DeliveryDoc Delivery

RejectReject

Conversion facility

Material location

ConversionMD Recording

Page 18: On the two sides of the pond

Quality assurance @ CCS | Automated markup and basic manual correction:

- headlines, illustrations, tables, captions, advertisements, etc.

- article segmentation and grouping of zones into articles (incl. continuation)

@ Content Provider (Library)

Recommended:

- Zoning: correct classification of blocks as „text“ or „illustration“ - Article segmentation: correct identification of headlines/text blocks/captions - Grouping: correct gouping of blocks (text, illustration) to articles - Metadata: correct title, issue date and issue number

Optional:

- Page types: correct page types - Page numbers: correct page sequence - OCR: perform text correction of specific zones (e.g. headlines, captions)

Page 19: On the two sides of the pond

Output | METS/ALTO package METS/ALTO metadata schemas to describe the structured digital output object

A newspaper issue processed in docWorks is converted into one METS XML file. It reflects the whole physical and logical structure, manages all links to the image files and the related ALTO XML files. ALTO is based on a standardized page description schema and contains all information of a page (print space, margins, coordinates, OCR results).

Benefits of structural markup:

- better browsing and more precise text search

- better access and display on tablet and mobile devices - automated article classification and clustering through data/text mining and linguistic technologies - user engagement for manual online text correction, article classification, annotation, building personal collections, etc. - sharing articles via social media platforms like Facebook, Twitter, etc. _______________

METS = Metadada Encoding and Transmission Standard

ALTO = Analyzed Layout and Text Object