1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel [email protected] cture 5 A research perspective on Digital Libraries
Jan 14, 2016
1 herbert van de sompel
CS 502 Computing Methods for Digital
Libraries
Cornell University – Computer ScienceHerbert Van de [email protected]
Lecture 5 A research perspective on Digital Libraries
2 herbert van de sompel
DL Ancestry
1992 1993 1994
UCSTRI
CS-TR NCSTRLWATERS
LTRS(TRSkit)
NTRS
Still operational, but no longerbeing developed.
Has also branched into manysub-fields of Physics, as well asMathematics and Chemistry.
STELAR ADSOther databases spun off (Physics /Geophysics, Space Instrumentation)
Current Status
Still In Use
Still In Use
Still In Use
Still In Use
Still In Use
Still In Use
Still In Use
Figure 3: Digital Library System Ancestry
1991
CORE
DLI
Physics e-PrintServer
3 herbert van de sompel
URLs to some of these DLs
ADS: http://adswww.harvard.edu/NCSTRL: http://www.ncstrl.orgUCSTRI: http://www.cs.indiana.edu:800/cstr/cover.htmlarXiv: http://arXiv.orgLTRS: http://techreports.larc.nasa.gov/ltrs/NTRS: http://techreports.larc.nasa.gov/cgi-bin/NTRS
4 herbert van de sompel
DL Architectural Review
Assumptions made in this perspective– things start with TCP/IP connectivity– distribute full content (reports, software, etc.)
• not only metadata
5 herbert van de sompel
DL Architecture History approach 1
1. Build special client and server (generally using Motif/X11, Tcl/Tk, etc.), and use TCP/IP as the transport protocol only• pros: rich functionality• cons: high development cost, client distribution
problem• observation: many of these projects spent more
time building the interfaces, protocols, searching, etc. than populating their DL!
6 herbert van de sompel
DL Architecture History approach 2
2. use standard protocols built upon TCP/IP: SMTP, FTP, Gopher, WAIS, HTTP, etc.• con: less functionality (restricted by protocol)• pros: less development cost, uses commonly
available clients• observation: this approach is now the most
common• The ones listed on slide 2 fit into this category
7 herbert van de sompel
Early TCP/IP DLs
a very old one: IETF:http://www.ietf.org/
• Internet RFC’s
• Very first TCP/IP DL?
8 herbert van de sompel
Early TCP/IP DLs
• Netlib– http://www.netlib.org/– begun in 1985, distributing mathematical
software via e-mail (SMTP)– other access methods and protocols added (ftp,
X11 client, http)
9 herbert van de sompel
Netlib 1995
10 herbert van de sompel
Netlib 2001
11 herbert van de sompel
Los Alamos arXiv
• Physics pre-print server– http://xxx.lanl.gov/ == http://arXiv.org– begun in 1991 as an e-mail service to exchange
TeX source of pre-prints in high energy physics– ftp, http access added shortly– Now THE communication channel in Physics– Paul Ginsparg
12 herbert van de sompel
Characteristics of early TCP/IP, non-HTTP DLs
• Useful – could get the “thing” that you were looking for
• Constrained by transport protocol– SMTP, FTP, etc. interface inherently “clunky”– Higher level services such as searching,
sophisticated browsing, etc. difficult to implement• Small scale
– would the same systems work well if the holdings went from 100’s or 1000’s to millions?
13 herbert van de sompel
Characteristics of early TCP/IP, HTTP DLs
• Initial HTTP implementations / conversions pretty much provided incremental steps in DL improvement– a “nice” ftp interface, maybe with better
searching and browsing – but the nature of the DLs changed little
• LTRS is an example of a http DL that is really: FTP+Searching(WAIS)+Browsing
• http://techreports.larc.nasa.gov/ltrs/• Also check out user interface of http://arXiv.org
14 herbert van de sompel
Early TCP/IP, HTTP DLs
• But http is a very general transport protocol, and it is possible to build even higher level protocols on top of it
• Combine this with the expressive HTTP client (web browser), and there is a lot of potential
• Dienst– (http://www.ncstrl.org/Dienst/htdocs/Info/
protocol4.html)– builds an actual DL protocol on top of HTTP
• 1994 -- the first to do so?• Open Archives Initiative: metadata harvesting
protocol on top of HTTP
15 herbert van de sompel
Sophistication increases, tracks meet
ftp / gopher
httpLTRS, e-print, Netlib, etc.
httpDienst
sophistication
time
research track
library automation track
16 herbert van de sompel
A Framework for Distributed Digital Object Services
Kahn/Wilensky Framework [Kahn 1995]• 1995• A high level document• Almost a definition of key concepts, terminologies, …
for next generation DLs• Foundation for a research discipline?• Not detailed enough to be a real architecture. • Architecture is independent of the type of data
stored in the DL
17 herbert van de sompel
KWF: key terms
• digital object (do)– A do is a data structure that contains
• Digital data; data is typed (cf MIME)• Persistent Key Metadata; especially handle• Other metadata (for instance Terms and
Conditions)• handle
– a handle is a unique, persistent name for a do• repository
– The place where do’s live– Has unique global name
• Repository Access Protocol (RAP)– To deposit/access do’s in repositories
18 herbert van de sompel
KWF: flow
Originator
digital object
makes a Data
which consists of
Key-Metadata• handle
handle comesfrom a handlegenerator
Handle Server
which registers the do’s handle with a handle server
at which point the do becomesa registered do
Accesses/Deposits the do in repositories by means of the Repository Access Protocol
What the client receives as a result of an access to a do is a dissemination.
client
Properties record per do
• Key metadata: handle• Other metadata:
• Terms and conditions
Transaction record per do
Repository
which can go in a repository
at which point the do becomes a stored do
19 herbert van de sompel
Digital objects
• do = data + key-metadata– data is typed; core types include:
• bit-sequence / set-of-bit-sequences• digital-object / set-of-digital-objects• handle / set-of-handles
– other types can be defined, and registered with a global type registry• definition and registration left undefined• ~ similar to MIME
– key-metadata includes handle– possibly other metadata (left undefined in KWF)
20 herbert van de sompel
Digital objects
• Composite do’s:– a do with data of type digital-object– non-composite do’s are elemental do’s– composite do’s can – for instance -- be used to
collect similar works together• composite do than contains a do for each work
of Shakespeare...
21 herbert van de sompel
Changing digital objects
• Mutable do’s can be changed once placed in a repository– key-metadata cannot be changed – the do’s handle does never change!
• Immutable do’s cannot be changed once placed in a repository– however, they can be deleted
22 herbert van de sompel
Handles
• Guest lecture by Professor Arms 02/19
23 herbert van de sompel
Repositories
• A network accessible storage system in which digital objects may be stored for possible subsequent access or retrieval
• A stored do is a do that resides in a repository• A registered do is a do that the repository has
registered with a handle server– storing and registering can be the same or
different processes
24 herbert van de sompel
Repositories
• A repository keeps a properties record for each do– contains key-metadata and any other metadata
the repository chooses to keep• A do may have a transaction record associated with
it in a repository
25 herbert van de sompel
Repository Access Protocol
• “Protocol” may be misleading, its really just the concept for a protocol
• RAP is designed to be simple; higher level services should come from other protocols
• KWF defines 3 basic operation classes:– ACCESS_DO [metadata; key-metadata, digital object]
• A dissemination of a do is the result of a request to access a do
– DEPOSIT_DO [metadata; key-metadata, digital object]– ACCESS_REF
• this is a means to tell the world about other ways (protocols) to access do’s in the repository.
26 herbert van de sompel
Terms and Conditions
• TC are attached to:– each do– each dissemination– each repository
• TC are a precondition for any operation on the above• Repositories responsible for enforcing TC
27 herbert van de sompel
Terms and Conditions
repositoryterms and conditions
terms and conditions
terms and conditions
digital object
dissemination
data data
1 1
1
11
1
1
1
1
1
1
1
1
N
Figure 1 from 95 TR-1593
28 herbert van de sompel
Digital Objects: Terms and Conditions
• Set by originator and/or repository
• Can be arbitrarily complex, but generally consist of:
– permissions: read, write, etc.
– authentication - person, group, etc.
– payment
– 3rd party intervention (possibly in support of the above)
29 herbert van de sompel
Readings
• Kahn, R. & Wilensky, R. 1995. A Framework for Distributed Digital Object Services
http://WWW.CNRI.Reston.VA.US/home/cstr/arch/k-w.html
• Arms, W.Y. 1995. Key Concepts in the Architecture of the Digital Library. In: D-Lib Magazine. http://www.dlib.org/dlib/July95/07arms.html
• Marc VanHeyningen. 1994. The Unified Computer Science Technical Report Index: Lessons in indexing diverse resources. http://www.cs.indiana.edu/ucstri/paper/paper.html