Top Banner
DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009
29

DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

Dec 18, 2015

Download

Documents

Clyde Thompson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

DML–CZ:asks and bids

Jiří Rákosník, Institute of Mathematics AS CR, Praha

Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009

Page 2: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

2

DML–CZ, a brief description

Digital Mathematics Library consisting of relevant mathematical literature published in the domain of the Czech Republic and Slovakia

Funding: R&D programme Information Society of the Academy of Sciences

2005–2009

Page 3: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

3

Partners

Institute of Mathematics AS CR, Praha (J. Rákosník) – coordinator, material selection, copyright, mathematical supervision

Institute of Computer Science, Masaryk University, Brno(M. Bartošek, P. Kovář, M. Šárfy, V. Krejčíř) – content management system, metadata Q/A, long-term archiving

Faculty of Informatics MU, Masaryk University, Brno (P. Sojka) – formats and tools, technical coordination, information retrieval, indexing

Faculty of Mathematics and Physics, Charles University, Praha (O. Ulrych, J. Veselý) – harvesting and adjusting metadata

Library AS CR, Praha (M. Lhoták, M. Duda, A. Ryšánková) – document scanning, graphical adjustment and OCR in the Digitization Centre Jenštejn

Jenštejn

Page 4: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

4

The scope

journals for mathematical research and education

conference proceedings monographs, textbooks altogether more than 200 000 pages

Page 5: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

5

Journals

Titleretro (scan)

retro-born-digital born-digital

Czechoslovak Mathematical Journal1951–1991 1992–2008

Aplikace Matematiky / Applications of Mathematics1956–1993 1994–2008

Archivum Mathematicum, Brno1965–1991 1992–2007

Commentationes Mathematicae Universitatis Carolinae1960–1990 1991–2008

Kybernetika1965–1997 1998–2008

Časopis pro pěstování matematiky a fysiky1872–1950

Časopis pro pěstování matematiky1951–1990

Mathematica Bohemica 1991 1992–2008

Acta Univ. Palackianae Olomucensis. Mathematica 1960–2003 2004–2008

Acta Mathematica et Informatica Univ. Ostraviensis1993–2003

Acta Mathematica Univ. Ostraviensis 2004–2008

Mathematica Slovaca1951–2008

Matematika–Fyzika–Informatika1991–2005 2006–2009

Pokroky matematiky, fyziky a astronomie1956–2005 2006–2009

2008 2009 2010–

pages: 106 000 133 000 30 000+

Page 6: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

6

Proceedings

Title volumes

Equadiff 11

Toposym 10

Asymptotic Statistics 4

Winter School Abstract Analysis 33

Nonlinear Analysis, Function Spaces, Applications 8

Function Spaces, Differential Operators, Nonlinear Analysis 6

2008 2009 2010–2008 2009 2010–

pages: 7 750 6 900

Page 7: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

7

Monographs

Title volumes

Bernad Bolzano Collection 21

From the collection of The Royal Czech Society for Sciences 15

Other monographs 2

2008 2009 2010–2008 2009 2010–

pages: 4 500 1 000

Page 8: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

8

Content

multilingual: Czech, Slovak, Russian, English, German, French, Italian

text, drawings, photographs (B&W) maths, physics, chemistry, education,

reviews, personalia, politics

Page 9: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

9

Inspiration

GDZ: technology for scanning, text adjustment, OCR

Cellule MathDoc, NUMDAMDML, document enhancement, presentation,

services

Page 10: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

10

Scanning

parameters – 600 dpi, 4bit depth scanning facilities – Digibook RGB 10000, A1

color book scanner and two book scanners Zeutschel OS 7000, A2 B/W

software – BookRestorer to make the scanned pages uniform (graphical adjustment, white space around the text body etc.)

Sirius system for archival storage of scans (put on CDs as TIFFs)

Page 11: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

11

Optical Character Recognition text OCR by two phase DML-OCR implemented

with ABBYY FineReader SDK 8.1 errors in maths reading → methods for

separation of text OCR and mathematics OCR maths: Infty system (Suzuki et al., Japan)

layout analysis character recognition structure analysis of math. expressions manual error correction

PDF with one OCR layer, multilayer PDF with several OCR layers (text, math in TeX, math in MathML or OMDoc)

99 %+ accuracy for text, 96 %+ for mathematics

Page 12: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

12

Optical Character Recognition text OCR by two phase DML-OCR implemented

with ABBYY FineReader SDK 8.1 errors in maths reading → methods for

separation of text OCR and mathematics OCR maths: Infty system (Suzuki et al., Japan)

layout analysis character recognition structure analysis of math. expressions manual error correction

PDF with one OCR layer, multilayer PDF with several OCR layers (text, math in TeX, math in MathML or OMDoc)

99 %+ accuracy for text, 96 %+ for mathematics

Page 13: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

13

Metadata and image enhancement/processing metadata standards – choice of standards (DC, MODS,

METS are supported by DSpace) Unicode with TeX → possible conversion to MathML maths standards rather than librarians’ standards

metadata acqusition – Zbl/MR, OCR tagging, (retyping) image enhancements – TIFF, PDF, jbig2 compression

as a measure of quality semantic processing – document markup

enhancement, document classification, citation linking, document clustering, indexing

references and fulltexts as part of metadata, English titles and MSC mandatory

OAI-PMH export trying to follow miniDML, T. Fischer etc.

Page 14: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

14

Metadata and image enhancement/processing metadata standards – choice of standards (DC, MODS,

METS are supported by DSpace) Unicode with TeX → possible conversion to MathML maths standards rather than librarians’ standards

metadata acqusition – Zbl/MR, OCR tagging, (retyping) image enhancements – TIFF, PDF, jbig2 compression

as a measure of quality semantic processing – document markup

enhancement, document classification, citation linking, document clustering, indexing

references and fulltexts as part of metadata, English titles and MSC mandatory

OAI-PMH export trying to follow miniDML, T. Fischer etc.

Page 15: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

15

Metadata Editor

metadata creation & DL integration developed in Brno for DML-CZ web-based application

web interface suite of scripts files in directories internal database

Page 16: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

16

Metadata Editor

input data loading articles building metadata editing references processing verification pdf-compilation export to DML-CZ

Page 17: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

17

Page 18: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

18

pages to beexcluded

article1

article2

Page 19: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

19

Page 20: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

20

Indexing, storage

indexing multiple OCR, multiple attribute layers (lemmas,

reviewer comments, semantic classifications, etc.) space

no problem to store and index that for all mathematics literature so far

software client/server architecture Lucene indexing software (OSS)

Page 21: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

21

Presentation

delivery customised digital library system DSpace (open

source, created at MIT) for final articles delivery, search

Manakin interface planned visualization techniques – “lost in

hyperspace fear”, vizualization of document clustering, Visual Browser (different user's eyes)

Page 22: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

22

Delivery

web portalunique and persistent URLs: PURL

interfaces to other servicesOAI-PMH harvesting – necessary to set up

the content for OAI-PMHbibtex exportGooglebot optimization of metadata

Page 23: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

23

Further problems and questions

paper classification automated MSC experiment automated MSC learning metadata from born-digital documents search OCR systems OCR XML postprocessing maths OCR

Page 24: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

24

Bids

Metadata Editor Applications for classification of publications Document markup enhancement

algorithms of language identification (bi-gram, tri-gram based, paragraph or even sentence level)

Measuring mathematical similarity of publications

OCR experience (possibly capacity) Adjusted metadata of high fidelity Experience (both good and wrong) in workflow

conduct

Page 25: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

25

Asks

Interlinking system (the EuDML core?) Effective system for adjusting and standardizing

scanned pages Metadata standards and metadata

conversion/export tools Unified authority base, journal names

abbreviations, … Effective maths OCR

Page 26: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

26

Asks

Coordinated effort/support in copyright issues Directive 2001/29/EC on the harmonisation of certain

aspects of copyright and related rights in the information society

Green Paper Copyright in the Knowledge Economy COM(2008) 466/3

Fifth Freedom in the single market: free movement of knowledge and innovation

ENCES (European Network for Copyright in support of Education and Science) http://www.ences.eu

moving wall supporting Open Access activities

Page 27: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

27

Asks

Document markup enhancement context dependent mapping from visual to logical

markup document classification, metrics, ontology construction,

comparison with MSC 2000 classification semiautomatic bibliography markup and metrics, global

mathematics citation index, “MathRank” document clustering (for visualization, …), identification

of plagiarism

Page 28: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

28

Mathematician’s expectations

Reliability rate of correspondence with the original document persistency

Search multilingual reliable identification of authors interlinking with Zentralblatt and Mathematical Reviews

Page 29: DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009.

29

Mathematician’s expectations

Copyright free access / reasonable moving wall

User friendly services citations export in bibtex/AmsTeX format interlinking between repositories unified layout design

Sustainable development