DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009
Dec 18, 2015
DML–CZ:asks and bids
Jiří Rákosník, Institute of Mathematics AS CR, Praha
Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009
2
DML–CZ, a brief description
Digital Mathematics Library consisting of relevant mathematical literature published in the domain of the Czech Republic and Slovakia
Funding: R&D programme Information Society of the Academy of Sciences
2005–2009
3
Partners
Institute of Mathematics AS CR, Praha (J. Rákosník) – coordinator, material selection, copyright, mathematical supervision
Institute of Computer Science, Masaryk University, Brno(M. Bartošek, P. Kovář, M. Šárfy, V. Krejčíř) – content management system, metadata Q/A, long-term archiving
Faculty of Informatics MU, Masaryk University, Brno (P. Sojka) – formats and tools, technical coordination, information retrieval, indexing
Faculty of Mathematics and Physics, Charles University, Praha (O. Ulrych, J. Veselý) – harvesting and adjusting metadata
Library AS CR, Praha (M. Lhoták, M. Duda, A. Ryšánková) – document scanning, graphical adjustment and OCR in the Digitization Centre Jenštejn
Jenštejn
4
The scope
journals for mathematical research and education
conference proceedings monographs, textbooks altogether more than 200 000 pages
5
Journals
Titleretro (scan)
retro-born-digital born-digital
Czechoslovak Mathematical Journal1951–1991 1992–2008
Aplikace Matematiky / Applications of Mathematics1956–1993 1994–2008
Archivum Mathematicum, Brno1965–1991 1992–2007
Commentationes Mathematicae Universitatis Carolinae1960–1990 1991–2008
Kybernetika1965–1997 1998–2008
Časopis pro pěstování matematiky a fysiky1872–1950
Časopis pro pěstování matematiky1951–1990
Mathematica Bohemica 1991 1992–2008
Acta Univ. Palackianae Olomucensis. Mathematica 1960–2003 2004–2008
Acta Mathematica et Informatica Univ. Ostraviensis1993–2003
Acta Mathematica Univ. Ostraviensis 2004–2008
Mathematica Slovaca1951–2008
Matematika–Fyzika–Informatika1991–2005 2006–2009
Pokroky matematiky, fyziky a astronomie1956–2005 2006–2009
2008 2009 2010–
pages: 106 000 133 000 30 000+
6
Proceedings
Title volumes
Equadiff 11
Toposym 10
Asymptotic Statistics 4
Winter School Abstract Analysis 33
Nonlinear Analysis, Function Spaces, Applications 8
Function Spaces, Differential Operators, Nonlinear Analysis 6
…
2008 2009 2010–2008 2009 2010–
pages: 7 750 6 900
7
Monographs
Title volumes
Bernad Bolzano Collection 21
From the collection of The Royal Czech Society for Sciences 15
Other monographs 2
2008 2009 2010–2008 2009 2010–
pages: 4 500 1 000
8
Content
multilingual: Czech, Slovak, Russian, English, German, French, Italian
text, drawings, photographs (B&W) maths, physics, chemistry, education,
reviews, personalia, politics
9
Inspiration
GDZ: technology for scanning, text adjustment, OCR
Cellule MathDoc, NUMDAMDML, document enhancement, presentation,
services
10
Scanning
parameters – 600 dpi, 4bit depth scanning facilities – Digibook RGB 10000, A1
color book scanner and two book scanners Zeutschel OS 7000, A2 B/W
software – BookRestorer to make the scanned pages uniform (graphical adjustment, white space around the text body etc.)
Sirius system for archival storage of scans (put on CDs as TIFFs)
11
Optical Character Recognition text OCR by two phase DML-OCR implemented
with ABBYY FineReader SDK 8.1 errors in maths reading → methods for
separation of text OCR and mathematics OCR maths: Infty system (Suzuki et al., Japan)
layout analysis character recognition structure analysis of math. expressions manual error correction
PDF with one OCR layer, multilayer PDF with several OCR layers (text, math in TeX, math in MathML or OMDoc)
99 %+ accuracy for text, 96 %+ for mathematics
12
Optical Character Recognition text OCR by two phase DML-OCR implemented
with ABBYY FineReader SDK 8.1 errors in maths reading → methods for
separation of text OCR and mathematics OCR maths: Infty system (Suzuki et al., Japan)
layout analysis character recognition structure analysis of math. expressions manual error correction
PDF with one OCR layer, multilayer PDF with several OCR layers (text, math in TeX, math in MathML or OMDoc)
99 %+ accuracy for text, 96 %+ for mathematics
13
Metadata and image enhancement/processing metadata standards – choice of standards (DC, MODS,
METS are supported by DSpace) Unicode with TeX → possible conversion to MathML maths standards rather than librarians’ standards
metadata acqusition – Zbl/MR, OCR tagging, (retyping) image enhancements – TIFF, PDF, jbig2 compression
as a measure of quality semantic processing – document markup
enhancement, document classification, citation linking, document clustering, indexing
references and fulltexts as part of metadata, English titles and MSC mandatory
OAI-PMH export trying to follow miniDML, T. Fischer etc.
14
Metadata and image enhancement/processing metadata standards – choice of standards (DC, MODS,
METS are supported by DSpace) Unicode with TeX → possible conversion to MathML maths standards rather than librarians’ standards
metadata acqusition – Zbl/MR, OCR tagging, (retyping) image enhancements – TIFF, PDF, jbig2 compression
as a measure of quality semantic processing – document markup
enhancement, document classification, citation linking, document clustering, indexing
references and fulltexts as part of metadata, English titles and MSC mandatory
OAI-PMH export trying to follow miniDML, T. Fischer etc.
15
Metadata Editor
metadata creation & DL integration developed in Brno for DML-CZ web-based application
web interface suite of scripts files in directories internal database
16
Metadata Editor
input data loading articles building metadata editing references processing verification pdf-compilation export to DML-CZ
20
Indexing, storage
indexing multiple OCR, multiple attribute layers (lemmas,
reviewer comments, semantic classifications, etc.) space
no problem to store and index that for all mathematics literature so far
software client/server architecture Lucene indexing software (OSS)
21
Presentation
delivery customised digital library system DSpace (open
source, created at MIT) for final articles delivery, search
Manakin interface planned visualization techniques – “lost in
hyperspace fear”, vizualization of document clustering, Visual Browser (different user's eyes)
22
Delivery
web portalunique and persistent URLs: PURL
interfaces to other servicesOAI-PMH harvesting – necessary to set up
the content for OAI-PMHbibtex exportGooglebot optimization of metadata
23
Further problems and questions
paper classification automated MSC experiment automated MSC learning metadata from born-digital documents search OCR systems OCR XML postprocessing maths OCR
24
Bids
Metadata Editor Applications for classification of publications Document markup enhancement
algorithms of language identification (bi-gram, tri-gram based, paragraph or even sentence level)
Measuring mathematical similarity of publications
OCR experience (possibly capacity) Adjusted metadata of high fidelity Experience (both good and wrong) in workflow
conduct
25
Asks
Interlinking system (the EuDML core?) Effective system for adjusting and standardizing
scanned pages Metadata standards and metadata
conversion/export tools Unified authority base, journal names
abbreviations, … Effective maths OCR
26
Asks
Coordinated effort/support in copyright issues Directive 2001/29/EC on the harmonisation of certain
aspects of copyright and related rights in the information society
Green Paper Copyright in the Knowledge Economy COM(2008) 466/3
Fifth Freedom in the single market: free movement of knowledge and innovation
ENCES (European Network for Copyright in support of Education and Science) http://www.ences.eu
moving wall supporting Open Access activities
27
Asks
Document markup enhancement context dependent mapping from visual to logical
markup document classification, metrics, ontology construction,
comparison with MSC 2000 classification semiautomatic bibliography markup and metrics, global
mathematics citation index, “MathRank” document clustering (for visualization, …), identification
of plagiarism
28
Mathematician’s expectations
Reliability rate of correspondence with the original document persistency
Search multilingual reliable identification of authors interlinking with Zentralblatt and Mathematical Reviews