From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS CR, Praha
Mar 31, 2015
From pixels and mindsto the mathematical knowledge in digital library
Petr Sojka, Masaryk University, Brno
Jiří Rákosník, Institute of Mathematics AS CR, Praha
Motivation for DML
the increment of new papers is growing faster and faster
Zentralblatt MATH: 2 711 559 items indexed 53 481 items added in 2008
MathSciNet 2 329 742 items indexed 80 000 new item yearly
Motivation for DML
Maths relies more than other sciencies on past literature
50 % of current references aim at literature 15 years old
25 % aim 25 year back
Number of references in Collection of Computer Science bibliographies
Publish or perish
“If [in 2600] you stacked all the new books being published next to each other, you would have to move at ninety miles an hour just to keep up with the end of the line. Of course, by 2600 new artistic and scientific work will come in electronic forms, rather than as physical books and paper. Nevertheless, if the exponential growth continued, there would be ten papers a second in my kind of theoretical physics, and no time to read them.”
Stephen Hawking
Motivation for DML-CZNUMDAM Numérisation de documents anciens mathématiquesERAM The Jahrbuch Project – Electronic Research Archive for
Mathematics (1868–1942): “Jahrbuch über die Fortschritte der Mathematik”
JSTOR archives of over one thousand academic journals across the humanities, social sciences, and sciences, as well as select monographs
EMANI electronic mathematical archiving network (Cornell, SUB Göttingen, MathDoc, Tsinghua University Library)
RusDML Russian DML (2 000 000 pages of papers in journals covered by Zentralblatt MATH)
…DML-CZ Digital Mathematical Library of mathematical literature
published in the Czech Republic and Slovakia
The occasion
R&D programme Information Society funded by the Academy of Sciences
project DML-CZ: Czech Digital Mathematics Library, 2005–2009
Partners
Institute of Mathematics AS CR, Praha (J. Rákosník) – coordinator, material selection, copyright, mathematical supervision
Institute of Computer Science, Masaryk University, Brno(M. Bartošek, P. Kovář, M. Šárfy, V. Krejčíř) – content management system, metadata Q/A, long-term archiving
Faculty of Informatics MU, Masaryk University, Brno (P. Sojka) – formats and tools, technical coordination, information retrieval, indexing
Faculty of Mathematics and Physics, Charles University, Praha (O. Ulrych, J. Veselý) – harvesting and adjusting metadata
Library AS CR, Praha (M. Lhoták, M. Duda, A. Ryšánková) – document scanning, adjustment and OCR in the Digitization Centre Jenštejn
Jenštejn
The aim
journals for mathematical research and education including Mathematica Slovaca
conference proceedings monographs, textbooks altogether about 200 000 pages
JournalsTitle retro (scan) retro-born
Czechoslovak Mathematical Journal 1951-1991 1992-2008
Aplikace Matematiky / Applications of Mathematics 1956-1993 1994-2008
Archivum Mathematicum, Brno 1965-1991 1992-2007
Commentationes Mathematicae Universitatis Carolinae 1960-1990 1991-2008
Kybernetika 1965-1997 1998-2008
Časopis pro pěstování matematiky a fysiky 1872-1950
Časopis pro pěstování matematiky 1951-1990
Mathematica Bohemica 1991-2008
Acta Univ. Palackianae Olomucensis. Mathematica 1960-2008
Acta Mathematica et Informatica Univ. Ostraviensis 1993-2003
Acta Mathematica Univ. Ostraviensis 2004-2008
Mathematica Slovaca 1951-2008
Matematika-Fyzika-Informatika 1991-2008
Pokroky matematiky, fyziky a astronomie 1956-2008
Journals - pilot part launched on 11th June 2008
Title retro (scan) retro-born
Czechoslovak Mathematical Journal 1951-1991 1992-2008
Aplikace Matematiky / Applications of Mathematics 1956-1993 1994-2008
Archivum Mathematicum, Brno 1965-1991 1992-2007
Commentationes Mathematicae Universitatis Carolinae 1960-1990 1991-2008
Kybernetika 1965-1997 1998-2008
Časopis pro pěstování matematiky a fysiky 1872-1950
Časopis pro pěstování matematiky 1951-1990
Mathematica Bohemica 1991-2008
Acta Univ. Palackianae Olomucensis. Mathematica 1960-2008
Acta Mathematica et Informatica Univ. Ostraviensis 1993-2003
Acta Mathematica Univ. Ostraviensis 2004-2008
Mathematica Slovaca 1951-2008
Matematika-Fyzika-Informatika 1991-2008
Pokroky matematiky, fyziky a astronomie 1956-2008
Workflow overview
Preparation
selection of titles – quality of content, historical value
preparation – acquisition of documents for scanning, content survey
copyright – negotiation with publishers or authors
Scanning
parameters – 600 dpi, 4bit depth scanning facilities – Digibook RGB 10000, A1
color book scanner and two book scanners Zeutschel OS 7000, A2 B/W
software – BookRestorer to make the scanned pages uniform (white space around text body, …);
Sirius system for archival storage of scans (put on CDs as TIFFs)
Optical Character Recognition text OCR by two phase DML-OCR implemented
with ABBYY FineReader SDK 8.1 errors in maths reading → Methods for
separation of text OCR and mathematics OCR maths: Infty system (Suzuki et al., Japan)
layout analysis character recognition structure analysis of math. expressions manual error correction
multilayer PDF with several OCR layers (text, math in TeX, math in MathML or OMDoc)
99 %+ accuracy for text, 96 %+ for mathematics
Metadata and Image Enhancement/Processing metadata standards – choice of standards (DC,
MODS, METS are supported by DSpace) metadata acqusition – Zbl/MR, OCR tagging,
(retyping) image enhancements – TIFF, PDF, jbig2
compression as a measure of quality semantic processing – document markup
enhancement, document classification, citation linking, document clustering, indexing
References and fulltexts are metadata as well, English titles and MSC mandatory. OAI-MPH export.
Metadata Editor
metadata creation & DL integration developed in Brno for DML-CZ web-based application
web interface suite of scripts files in directories internal database
Storage, indexing
space – multiple OCR, multiple attribute layers (lemmas, reviewer comments, semantic classifications, etc.), no problems to store and index that for all mathematics literature so far
software client/server architecture, Lucene indexing software (OSS)
Document Markup Enhancement Methods context dependent mapping from visual to
logical markup algorithms of language identification (bi-gram,
tri-gram based, paragraph or even sentence level)
document classification, metrics, ontology construction, comparison with AMS 2000 classification
semiautomatic bibliography markup and metrics, global mathematics citation index, “MathRank”
document clustering (for visualization, …), identification of near duplicates
Presentation
delivery – customised digital library system DSpace (open source, created at MIT) for final articles delivery, search; Manakin interface
planned visualization techniques – “lost in hyperspace fear”, vizualization of document clustering, Visual Browser (different user's eyes)
Delivery
web portal – unique and persistent URLs: Digital Object Identifier DOI (PURL, URN? …)
interfaces to other services – OAI-PMH harvesting, bibtex export, Googlebot optimization
indexing, search relevance – Lucene, customized for maths (Experiments with Manatee and EDBM-2 (Zbl, NUMDAM))?
Further problems and questions
paper classification automated MSC experiment automated MSC learning metadata from born-digital documents search OCR systems OCR XML postprocessing maths OCR
Possibilities