Top Banner
From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS CR, Praha
22

From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Mar 31, 2015

Download

Documents

Leonard Dorrell
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

From pixels and mindsto the mathematical knowledge in digital library

Petr Sojka, Masaryk University, Brno

Jiří Rákosník, Institute of Mathematics AS CR, Praha

Page 2: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Motivation for DML

the increment of new papers is growing faster and faster

Zentralblatt MATH: 2 711 559 items indexed 53 481 items added in 2008

MathSciNet 2 329 742 items indexed 80 000 new item yearly

Page 3: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Motivation for DML

Maths relies more than other sciencies on past literature

50 % of current references aim at literature 15 years old

25 % aim 25 year back

Number of references in Collection of Computer Science bibliographies

Page 4: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Publish or perish

“If [in 2600] you stacked all the new books being published next to each other, you would have to move at ninety miles an hour just to keep up with the end of the line. Of course, by 2600 new artistic and scientific work will come in electronic forms, rather than as physical books and paper. Nevertheless, if the exponential growth continued, there would be ten papers a second in my kind of theoretical physics, and no time to read them.”

Stephen Hawking

Page 5: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Motivation for DML-CZNUMDAM Numérisation de documents anciens mathématiquesERAM The Jahrbuch Project – Electronic Research Archive for

Mathematics (1868–1942): “Jahrbuch über die Fortschritte der Mathematik”

JSTOR archives of over one thousand academic journals across the humanities, social sciences, and sciences, as well as select monographs

EMANI electronic mathematical archiving network (Cornell, SUB Göttingen, MathDoc, Tsinghua University Library)

RusDML Russian DML (2 000 000 pages of papers in journals covered by Zentralblatt MATH)

…DML-CZ Digital Mathematical Library of mathematical literature

published in the Czech Republic and Slovakia

Page 6: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

The occasion

R&D programme Information Society funded by the Academy of Sciences

project DML-CZ: Czech Digital Mathematics Library, 2005–2009

Page 7: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Partners

Institute of Mathematics AS CR, Praha (J. Rákosník) – coordinator, material selection, copyright, mathematical supervision

Institute of Computer Science, Masaryk University, Brno(M. Bartošek, P. Kovář, M. Šárfy, V. Krejčíř) – content management system, metadata Q/A, long-term archiving

Faculty of Informatics MU, Masaryk University, Brno (P. Sojka) – formats and tools, technical coordination, information retrieval, indexing

Faculty of Mathematics and Physics, Charles University, Praha (O. Ulrych, J. Veselý) – harvesting and adjusting metadata

Library AS CR, Praha (M. Lhoták, M. Duda, A. Ryšánková) – document scanning, adjustment and OCR in the Digitization Centre Jenštejn

Jenštejn

Page 8: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

The aim

journals for mathematical research and education including Mathematica Slovaca

conference proceedings monographs, textbooks altogether about 200 000 pages

Page 9: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

JournalsTitle retro (scan) retro-born

Czechoslovak Mathematical Journal 1951-1991 1992-2008

Aplikace Matematiky / Applications of Mathematics 1956-1993 1994-2008

Archivum Mathematicum, Brno 1965-1991 1992-2007

Commentationes Mathematicae Universitatis Carolinae 1960-1990 1991-2008

Kybernetika 1965-1997 1998-2008

Časopis pro pěstování matematiky a fysiky 1872-1950

Časopis pro pěstování matematiky 1951-1990

Mathematica Bohemica 1991-2008

Acta Univ. Palackianae Olomucensis. Mathematica 1960-2008

Acta Mathematica et Informatica Univ. Ostraviensis 1993-2003

Acta Mathematica Univ. Ostraviensis 2004-2008

Mathematica Slovaca 1951-2008

Matematika-Fyzika-Informatika 1991-2008

Pokroky matematiky, fyziky a astronomie 1956-2008

Page 10: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Journals - pilot part launched on 11th June 2008

Title retro (scan) retro-born

Czechoslovak Mathematical Journal 1951-1991 1992-2008

Aplikace Matematiky / Applications of Mathematics 1956-1993 1994-2008

Archivum Mathematicum, Brno 1965-1991 1992-2007

Commentationes Mathematicae Universitatis Carolinae 1960-1990 1991-2008

Kybernetika 1965-1997 1998-2008

Časopis pro pěstování matematiky a fysiky 1872-1950

Časopis pro pěstování matematiky 1951-1990

Mathematica Bohemica 1991-2008

Acta Univ. Palackianae Olomucensis. Mathematica 1960-2008

Acta Mathematica et Informatica Univ. Ostraviensis 1993-2003

Acta Mathematica Univ. Ostraviensis 2004-2008

Mathematica Slovaca 1951-2008

Matematika-Fyzika-Informatika 1991-2008

Pokroky matematiky, fyziky a astronomie 1956-2008

Page 11: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Workflow overview

Page 12: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Preparation

selection of titles – quality of content, historical value

preparation – acquisition of documents for scanning, content survey

copyright – negotiation with publishers or authors

Page 13: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Scanning

parameters – 600 dpi, 4bit depth scanning facilities – Digibook RGB 10000, A1

color book scanner and two book scanners Zeutschel OS 7000, A2 B/W

software – BookRestorer to make the scanned pages uniform (white space around text body, …);

Sirius system for archival storage of scans (put on CDs as TIFFs)

Page 14: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Optical Character Recognition text OCR by two phase DML-OCR implemented

with ABBYY FineReader SDK 8.1 errors in maths reading → Methods for

separation of text OCR and mathematics OCR maths: Infty system (Suzuki et al., Japan)

layout analysis character recognition structure analysis of math. expressions manual error correction

multilayer PDF with several OCR layers (text, math in TeX, math in MathML or OMDoc)

99 %+ accuracy for text, 96 %+ for mathematics

Page 15: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Metadata and Image Enhancement/Processing metadata standards – choice of standards (DC,

MODS, METS are supported by DSpace) metadata acqusition – Zbl/MR, OCR tagging,

(retyping) image enhancements – TIFF, PDF, jbig2

compression as a measure of quality semantic processing – document markup

enhancement, document classification, citation linking, document clustering, indexing

References and fulltexts are metadata as well, English titles and MSC mandatory. OAI-MPH export.

Page 16: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Metadata Editor

metadata creation & DL integration developed in Brno for DML-CZ web-based application

web interface suite of scripts files in directories internal database

Page 17: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Storage, indexing

space – multiple OCR, multiple attribute layers (lemmas, reviewer comments, semantic classifications, etc.), no problems to store and index that for all mathematics literature so far

software client/server architecture, Lucene indexing software (OSS)

Page 18: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Document Markup Enhancement Methods context dependent mapping from visual to

logical markup algorithms of language identification (bi-gram,

tri-gram based, paragraph or even sentence level)

document classification, metrics, ontology construction, comparison with AMS 2000 classification

semiautomatic bibliography markup and metrics, global mathematics citation index, “MathRank”

document clustering (for visualization, …), identification of near duplicates

Page 19: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Presentation

delivery – customised digital library system DSpace (open source, created at MIT) for final articles delivery, search; Manakin interface

planned visualization techniques – “lost in hyperspace fear”, vizualization of document clustering, Visual Browser (different user's eyes)

Page 20: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Delivery

web portal – unique and persistent URLs: Digital Object Identifier DOI (PURL, URN? …)

interfaces to other services – OAI-PMH harvesting, bibtex export, Googlebot optimization

indexing, search relevance – Lucene, customized for maths (Experiments with Manatee and EDBM-2 (Zbl, NUMDAM))?

Page 21: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Further problems and questions

paper classification automated MSC experiment automated MSC learning metadata from born-digital documents search OCR systems OCR XML postprocessing maths OCR

Page 22: From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Possibilities