On the Two Sides of the Pond By Hans-Jörg Lieder, Head of the Department of Bibliographic Services – Union Catalogue of Serials Staatsbibliothek zu Berlin - Preußischer Kulturbesitz; Dr. Katalin Radics, Distinguished Librarian; Librarian of the West European Collections and Classics Young Research Library, University of California, Los Angeles
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
On the Two Sides of the PondBy Hans-Jörg Lieder,
Head of the Department of Bibliographic Services – Union Catalogue of Serials Staatsbibliothek zu Berlin - Preußischer Kulturbesitz;
Dr. Katalin Radics,Distinguished Librarian; Librarian of the West European Collections and Classics
Young Research Library, University of California, Los Angeles
UNIQUE EUROPEAN MATERIALS – HELD IN A US LIBRARY
Partnershipbetween the UCLA Library and Staatsbibliothek zu Berlin
Newspapers on the way to discoloring and disintegration
Storage facility of the University of California Libraries on the UCLA campus
- Leaflets 13”x18.5” or 33cm x 47cm
- Imprint indicating the title, date, the number of the issue; warning
-Published four or five times a day
UCLA stamps including receiving dates
Packed in wrapping paper probably after 1940, packages of 700-800 sheets
No documentation (ordering or receivingrecords) in the library archives; no correspondence
Normal serial subscription scheme (?)
Very minimal cataloging record – very low use
Towards a Weeding Decision
Brittle conditionCheck for other holdings in California, US and World libraries
OCLC – no other holdings at the time of checking Nine 1938 issues at BNF
No holding at the German National Library (Deutsche Nationalbibliothek)
Contact with head of Zeitungsabteilung, Staatsbibliothek – no holding in Germany
UNIQUE!!! Decision: keep and preserve the UCLA holdings.
Keep and Preserve
9600 pages
1936-1940 with gaps
Acid-free boxes
The most fragile pages in mylar
Digitization Project
Funding for digitization
Highest quality resolution: 600 dpi RGB
Add minimal metadata
Title Deutsches Nachrichtenbüro. 5 Jahrg., Nr. 1581, 1938 October 1, Erste Morgen-Ausgabe Alt ID 3813183_1938-10-01_1581 [Local]AltTitle Erste Morgen-Ausgabe [Descriptive]Deutsches Nachrichtenbüro [Descriptive]Date October 1, 1938 [Publication]1938-10-01 [Normalized]Format 1 p. [Extent]Language ger Name University of California, Los Angeles. Library. Dept. of Special Collections [Repository]Type newspapers [Genre]text [Type Of Resource]
More sophisticated searching capability needed – day by day chronicle of the Third Reich for a short period of time
-events-names-institutions etc.
Deutsches Nachrichten Büro – December 5, 1933-network of 36 local services (Landesdienste)
Indexing needed
Fraktur – major problem
Transliteration into Latin characters
OCR (Optical Character Recognition) – has to be made in Germany
Looking for a GermanPartner
Not a problem … here we are!
… but who are “we”?• Project: Europeana Newspapers: http://www.europeana-newspapers.eu/
• 18 partners from 12 countries
• Tasks:• Provide OCR for 18 million pages
• Provide OLR for 2 million pages
• Provide NER experimentally in assorted languages
• Provide best practice recommendations for newspaper metadata
• Provide quality prediction tools
• Aggregate content and make it available to TEL and Europeana
OCR = Optical Character RecognitionOLR = Optical Layout RecognitionNER = Named Entities Recognition
A Dance of Acronyms:UCLA, SBB and CCS
UCLA sent data on hard drive
SBB
• Checked data for correctness and moved images into directory structure • Sent data to CCS in Hamburg for OCR and OLR
CCS (Content Conversion Specialists)
• Created full texts per article• Stuck data in NZ web service for preliminary presentation purposes
SBB • Will perform QA of OCR and OLR results• Will provide all data to UCLA for further use• Will present data in ZEFYS, its own newspaper portal; to the Deutsche Digitale Bibliothek; to TEL (The European Library) and to Europeana
Layout and structure analysis recognition of words, text lines, text blocks,
columns and classification of text blocks, illustrations, advertisements, tables and the following page types:
- title page (the title page of an issue) - content page (a page that consists of content/text only) - illustration page (a page that has at least one illustration) - advertisement page (a page that contains adverts only)
Structure analysis through classification of headlines and grouping of zones into articles
- headlines, illustrations, tables, captions, advertisements, etc.
- article segmentation and grouping of zones into articles (incl. continuation)
@ Content Provider (Library)
Recommended:
- Zoning: correct classification of blocks as „text“ or „illustration“ - Article segmentation: correct identification of headlines/text blocks/captions - Grouping: correct gouping of blocks (text, illustration) to articles - Metadata: correct title, issue date and issue number
Optional:
- Page types: correct page types - Page numbers: correct page sequence - OCR: perform text correction of specific zones (e.g. headlines, captions)
Output | METS/ALTO package METS/ALTO metadata schemas to describe the structured digital output object
A newspaper issue processed in docWorks is converted into one METS XML file. It reflects the whole physical and logical structure, manages all links to the image files and the related ALTO XML files. ALTO is based on a standardized page description schema and contains all information of a page (print space, margins, coordinates, OCR results).
Benefits of structural markup:
- better browsing and more precise text search
- better access and display on tablet and mobile devices - automated article classification and clustering through data/text mining and linguistic technologies - user engagement for manual online text correction, article classification, annotation, building personal collections, etc. - sharing articles via social media platforms like Facebook, Twitter, etc. _______________
METS = Metadada Encoding and Transmission Standard