Top Banner
Cornell CS 502 Identifiers and Types CS 502 – 20020205 Carl Lagoze – Cornell University
33

Identifiers and Types

Jan 06, 2016

Download

Documents

Ka_leb

Identifiers and Types. CS 502 – 20020205 Carl Lagoze – Cornell University. Identity Change Persistence. Paradox: reality contains things that persist and change over time Heraclitus and Plato: can you step into the same river twice? - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Identifiers and Types

Cornell CS 502

Identifiers and Types

CS 502 – 20020205Carl Lagoze – Cornell University

Page 2: Identifiers and Types

Cornell CS 502

Identity Change Persistence

• Paradox: reality contains things that persist and change over time– Heraclitus and Plato: can you step into the same

river twice?– Ship of Theseus: over the years, the Athenians

replaced each plank in the original ship of Theseus as it decayed, thereby keeping it in good repair. Eventually, there was not a single plank left of the original ship. So, did the Athenians still have one and the same ship that used to belong to Theseus

Page 3: Identifiers and Types

Cornell CS 502

Identity Change Persistence

Page 4: Identifiers and Types

Cornell CS 502

Identifiers

• Provide a key or handle linking abstract concepts to physical or perceptible entities

• Provide us with a necessary figment of persistence

• They are perhaps the one essential and common form of metadata

• Why bother?– Finding things– Referring to things– Asserting ownership over things

Page 5: Identifiers and Types

Cornell CS 502

I have lots of identifiers

• Carl Jay Lagoze, Dad, Hey you• 123-456-7890 (SSN)• 1234-5678-1234-1234 (Visa Card)• FZBMLH (US Airways locator on Jan 31 flight to

San Diego)

Page 6: Identifiers and Types

Cornell CS 502

Identifier Issues

• Location independence• Global uniqueness• Persistent across time• Human vs. machine generation• Machine resolution• Administration (centralized vs.

decentralized) • Intrinsic semantics• Type specific

Page 7: Identifiers and Types

Cornell CS 502

Two common pre-digital identifiers

• ISBN (International Standard Book Number)– Uniquely identifies every monograph (book)– One ISBN for each format

• HP & SS hardback 0590353403 • HP & SS softcover 059035342X

– Number is semantically meaningful (components)– International administration (>150 countries)

• ISSN (International Standard Serial Number)– Uniquely identifies every serial (not issue or volume)– Semantically meaningless– International administration

Page 8: Identifiers and Types

Cornell CS 502

URI: Universal Resource Identifier

• Generic syntax for identifiers of resources• Defined by RFC 2396• Syntax: <scheme>://<authority><path>?<query>

– Scheme• Defines semantics of remainder of URI• ftp, gopher, http, mailto, news, telnet

– Authority• Authority governing namespace for remainder of URI• Typically Internet-based server

– Path• Identification of data within scope of authority

– Query• String of information to be interpreted by authority

Page 9: Identifiers and Types

Cornell CS 502

Why is RFC 2396 so big?

• Character encodings• Partial and relative URIs

Page 10: Identifiers and Types

Cornell CS 502

URL: Universal Resource Locator

• String representation of the location for a resource that is available via the Internet

• Use URI syntax• Scheme has function of defining the access

(protocol) method. Used by client to determine the protocol to “speak”.– http://an.org/index.html - open socket to an.org on

port 80 and issue a GET for index.html– ftp://an.org/index.html - open socket to an.org on

port 21, open ftp session, issue ftp get for index.html….

Page 11: Identifiers and Types

Cornell CS 502

URL Issues

• Persistence• Location dependence• Valid only at the item level

– What about works, expressions, manifestations

• Multiple resolution– “get the one that is cheapest, most reliable, most

recent, most appropriate for my hardware, etc.”

• Non-digital resources?• Disconnection from the entity

Page 12: Identifiers and Types

Cornell CS 502

URC – Uniform Resource Characteristic (Catalog)

• Failed but interesting effort– Multiple resolution– Describe resource by its characteristics

• Provide adequate bundled information about a resource (metadata) to create identification block for any given resource (including locations)– Exactly what are the common set of characteristics

for describing different types of resources?– Where are these characteristics stored?

• Robust URLs – Berkeley– Characteristic of document or metadata is computed

automatically via fingerprint of its content.

Page 13: Identifiers and Types

Cornell CS 502

URN – Universal Resource Name

• “globally unique, persistent names”• Independence from location and location

methods

<URN> ::= "urn:" <NID> ":" <NSS> • NID: namespace identifier• NSS: namespace-specific string• examples:

• urn:ISSN:1234-5678• urn:isbn:9044107642• urn:doi:10.1000/140

Page 14: Identifiers and Types

Cornell CS 502

Handles: Names for Internet Resources

• Naming system for location-independent, persistent names

• http://www.handle.net

The resource named by a Handle can be:

• A library item• A collection of library items• A catalog record• A computer• An e-mail address• A public key for encryption• etc., etc., etc. ....

Page 15: Identifiers and Types

Cornell CS 502

<naming_authority>/<locally_unique_string>

or

hdl:<naming_authority>/<locally_unique_string>

Examples

10.1234/1995.02.12.16.42.21;9 (date-time stamp)

cornell.cs/cstr-94.45 (mnemonic name)

loc/a43v-8940cgr (random string)

Syntax of Handles

Page 16: Identifiers and Types

Cornell CS 502

Example of a Handle and its DataUsed to Identify Two Locations

URLloc.ndlp.amrlp/123456 http://www.loc.gov/.....

Handle Data type Handle data

RAPloc/repository-1r4589

Page 17: Identifiers and Types

Cornell CS 502

Use of Handles in a Digital Library

Repository

Handle System Search System

Userinterface

Page 18: Identifiers and Types

Cornell CS 502

Scalability and Caching

Client Caching Server Handle Servers

Hash

Cache

Hash table

Page 19: Identifiers and Types

Cornell CS 502

Replication for Performance and Reliability

Example: the Global Handle System

Washington, DCLos Angeles, CA

Page 20: Identifiers and Types

Cornell CS 502

Global and Local Handle Servers

Global

Local Handle Servers

Page 21: Identifiers and Types

Cornell CS 502

Ways to Resolve HandlesI. Resolution by Program

Any program can resolve Handles by sending standard format messages to the Handle System.

A set of procedures, with Java and C versions, is available to link into applications programs. They are known as the Handle Client Library.

Page 22: Identifiers and Types

Cornell CS 502

Ways to Resolve HandlesII. Web Browsers

Browsers modified to recognize Handles. This requires installation of a Handle Extension.

1. Whenever the browser expects a URL, it will recognize "hdl:".

2. The Handle is passed to the Handle System, where it is resolved and a data item of type "URL" is returned.

Handle Extensions for Netscape and Internet Explorer are available for most versions of Windows.

Page 23: Identifiers and Types

Cornell CS 502

Ways to Resolve HandlesIII. Proxies

Any Web browser can resolve Handles, even with no extension, via a proxy. For example, the following URL can be used to resolve the Handle loc.ndlp.amrlp/3a16616:

http://hdl.handle.net/loc.ndlp.amrlp/3a16616

Page 24: Identifiers and Types

Cornell CS 502

Proxy Resolution

WWWbrowser

HTTPserver

URL to Proxy

URL

URL

Resource

Handle Systemhdl.handle.net

Proxyserver

Page 25: Identifiers and Types

Cornell CS 502

OCLC's Persistent URL (PURL)

• A PURL is a URL -> Is fully compatible with today's Internet browsers -> Users need no special software• Has some of the desirable features of URNs• Lacks some desirable features of URNs -> Resolves only to a URL -> Does not support multiple resolution• Developed by OCLC• Software openly available

http://www.purl.org

Page 26: Identifiers and Types

Cornell CS 502

PURL Syntax

• A PURL is a URL.

• PURL resolvers use standard http redirects to return the actual URL.

http://purl.oclc.org/keith/home

protocol resolver address name

Page 27: Identifiers and Types

Cornell CS 502

PURL Namespaces

A PURL provides a local (not-global namespace)

http://purl.oclc.org/keith/home

is different from

http://purl.stanford.edu/keith/home

Page 28: Identifiers and Types

Cornell CS 502

OCLC PURL Resolution

WWWbrowser

PURLserver

HTTPserver

PURLdatabase

PURL

URL

URL

Resource

Page 29: Identifiers and Types

Cornell CS 502

Why haven’t URNs caught on?

• Complexity of systems• One size does not fit all - special purpose URN

schemes have been successful, e.g., PubMed ID, Astrophysics BibCode

• No guarantee of persistence – longevity is an organizational not technical issue

• Requires well-regulated administrative systems

• Absence of “killing” applications – although reference linking is emerging

Page 30: Identifiers and Types

Cornell CS 502

Types: Not all data and content is the same

• Format or Genre– How you sense it– What you can do with it– E.G. – audio, video, map, book

• Type– What you need to process it– What is its bit layout

• Compression or encoding

Page 31: Identifiers and Types

Cornell CS 502

Multipurpose Internet Mail Extensions

• RFC 822 – define textual format of email messages

• RFC 2045-2049 – Extend textual email to allow– Character sets other than US-ASCII– Extensible set of non-ASCII types for message bodies– Definition of multi-part mail (attachments)

Page 32: Identifiers and Types

Cornell CS 502

MIME Types

• Two part type hierarchy– Top level type

• text• audio• video• image• application • multipart

– Examples• text/plain image/gif application/postscript

• Extensions are handled by IANA

Page 33: Identifiers and Types

Cornell CS 502

MIME in HTTP (Content Negotiation)

• Accept in request-header– Accept: text/plain; q=0.5, text/html, text/x-dvi;

q=0.8, text/xml • text/plain and text/xml are preferred, then text/x-dvi,

then text/html

• Content-Type in response-header– Content-Type: text/html