1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation
1
CS 502: Computing Methods for Digital Libraries
Lecture 28
Current work in preservation
2
Administration
Review class
• Tuesday, 12:20. Room to be announced on web site "Notices".
• Format, questions (by you) and answers (by me).
Laptops
• Return before examination. Bring receipt to examination.
Examination
• Part 1: 5 questions, 1.5 hour time limit
• Part 2: nomad experiment questionnaire, no time limit
3
Education and research
Digital libraries in a state of flux:
• Much of this class has described material that is still experimental
• Cornell people and our colleagues are actively involved in many aspects
This class:
• Recent activities in preservation of materials on the web
• Some of my recent work
4
Some light reading
William Y. Arms, "Preservation of scientific serials: three current examples." Journal of Electronic Publishing, 5(2), December 1999. http://www.press.umich.edu/jep/05-02/arms.html
William Y. Arms, "Economic models for open-access publishing." iMP, March 2000. http://www.cisp.org/imp/march_2000/03_00arms.htm
5
Preservation of serials
September 1999 -- Workshop chaired by Deanna Marcum, Don Waters, Cliff Lynch
Issues in preserving online journals for 100 years
Invited paper by William Arms
"Preservation of Scientific Serials: Three Current Examples"
• ACM Digital Library• Internet RFC Series• D-Lib Magazine
Motivated by realization that early preservation work may be tackling the wrong problem
6
Publisher's role in preservation
Life cycle of electronic publication
1. Active management by publisher
2. Long-term preservation by another organization
Overall observation
• The length of #1 may be very short or hundreds of years
• The most vulnerable time is the transition between #1 and #2
Preservation discussions have emphasized #2 (e.g., 5 level model)
7
ACM Digital Library
Organizational
• ACM is a stable organization that considers the Digital Library one of its principal assets
Rights
• ACM either owns copyright or has full preservation rights
Technical
• Complex: relational database (schema), SGML (DTD), rendering software, private metadata system
• Strong computing department
Replication
• No independent mirrors
8
Internet RFC Series
Organizational
• Complex relationship between Internet Society (ISCO), Internet Engineering Task Force (IETF) and RFC editor. Currently actively managed, but no long-term commitment
• Secretariat & RFC editor -- income from meetings & grants
Rights
• ISOC and IETF have very broad rights
Technical
• Simple: text only (a few PostScript)
Replication
• Several independent mirrors
9
D-Lib Magazine
Organizational
• Published by CNRI, reliant on grants.
Rights
• Authors own rights in articles. CNRI owns rights in other materials.
Technical
• Simple: uses basic web technology.
• Used for experiments in DOIs, XML metadata, etc.
Replication
• Several independent mirrors
10
Approaches to preservation of the web
Partnership with publishers
Publishers and libraries as partners
Selective collection of open access web
Librarianship in a new domain
Bulk collection of open access web
Automatic librarianship
11
Partnerships with publishers
Library of Congress and UMI
• US theses and dissertations
American Physical Society and Cornell University
• Journals in physics
Elsevier Science
• Policy statement on archiving
12
Partnership with publishers
Publishers and libraries as partners
Selective collection of open access web
Librarianship in a new domain
Bulk collection of open access web
Automatic librarianship
Approaches to preservation of the web
Cornell and Library of Congress
13
Selective preservation
Selection of web sites
Example: National Library of Australia
• national importance
• multiple versions (print and online)
• authority and research value
14
Selection of web sites
Pragmatic considerations
• technical complexity
-- not all standards are good
• frequency of making copies
• COST
Librarianship in a new domain
15
Catalogs and indexes
Example: CORC
• simple standard using Dublin Core
• tools for creating records
• COST
Librarianship in a new domain
16
Bulk collection: automatic librarianship
Volumes of information are too great for human selection, indexing and management
Examples:
• Kulturarw3 -- National Library of Sweden
• Internet Archive -- Brewster Kahle
Automatic methods are used to collect, organize and provide access
17
Automatic librarianship
Collection
Example: Internet Archive
• Collecting open access web since 1996
• Complete sweep of web approximately once a month
• HTML pages only
• 14 terabytes of data (soon all online)
• access for researchers using Unix tools
• 7 people
18
Automatic librarianship
Indexing
Examples:
• ResearchIndex
19
Legal issues
Legal position of archives that download open access materials is unclear
• Preservation is in the national interest
• See the discussion in The Digital Dilemma (National Academy of Sciences, 1999)
• Crucial factor is economic impact on copyright owners
• Library of Congress has no special position except via copyright deposit
• U.S. Copyright Office offer to help clarification
20
Current activities
Selection: guidelines and prototypes
• Library of Congress working group
• Political web sites
Tools
• Web site mirroring
• Web site profiler (M.Eng. project)
Copyright
• Ad hoc working group (Deanna Marcum, Bill Arms)
21
CS 502Computing Methods for Digital
Libraries
THE END