ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO ResourceSync: Web-Based Resource Synchronization Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp ResourceSync is funded by The Sloan Foundation & JISC #resourcesync
79
Embed
NISO Forum, Denver, September 24, 2012: ResourceSync: Web-Based Resource Synchronization
ResourceSync: Web-Based Resource Synchronization. Also for Data. Herbert Van de Sompel, Digital Library Researcher, Los Alamos National Laboratory, and Co-chair of NISO’s ResourceSync Working Group
Web applications frequently leverage resources made available by remote Web servers. As resources are created, updated, or deleted these applications face challenges to remain in lockstep with the server’s change dynamics. Several approaches exist to help meet this challenge for use cases where “good enough” synchronization is acceptable. But when strict resource coverage or low synchronization latency is required, commonly accepted Web-based solutions remain elusive. Motivated by the need to synchronize resources for applications in the realm of cultural heritage and research communication, the National Information Standards Organization (NISO) and the Open Archives Initiative (OAI) have launched the ResourceSync project that aims at designing an approach for resource synchronization that is aligned with the web architecture and that has a fair chance of adoption by different communities. The presentation will discuss some motivating use cases and will provide a perspective on the resource synchronization problem that results from ResourceSync project discussions. It will provide an overview of the ongoing thinking regarding an approach to address the challenges and will pay special attention to aspects that are relevant for the synchronization of data.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync: Web-Based
Resource Synchronization
Herbert Van de Sompel Los Alamos National Laboratory
@hvdsomp
ResourceSync is funded by The Sloan Foundation & JISC #resourcesync
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Los Alamos National Laboratory & OAI: Martin Klein, Robert Sanderson, Herbert Van de Sompel
Cornell University & OAI: Berhard Haslhofer, Simeon Warner
Old Dominion University & OAI: Michael L. Nelson
University of Michigan & OAI: Carl Lagoze
NISO: Todd Carpenter, Nettie Lagace, Peter Murray
ResourceSync Core Team – NISO & OAI
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
• Manuel Bernhardt, Delving B.V. • Kevin Ford, Library of Congress • Richard Jones, JISC • Graham Klyne, JISC • Stuart Lewis, JISC • David Rosenthal, LOCKSS • Christian Sadilek, Red Hat • Shlomo Sanders, Ex Libris, Inc. • Sjoerd Siebinga, Delving B.V. • Ed Summers, Library of Congress • Jeff Young, OCLC Online Computer Library Center
ResourceSync Technical Group
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync
ResourceSync: What & Why?
Problem Perspective & Conceptual Approach
Possible Technical Choices
Q&A
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync
ResourceSync: What & Why?
Problem Perspective & Conceptual Approach
Possible Technical Choices
Q&A
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Synchronize What?
• Web resources – things with a URI that can be dereferenced and are cache-able (no dependency on underlying OS, technologies etc.)
• Small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources)
• That change slowly (weeks/months) or quickly (seconds), and where latency needs may vary
• Focus on needs of research communication and cultural heritage organizations, but aim for generality
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Why?
… because lots of projects and services are doing synchronization but have to resort to ad-hoc, case by case, approaches!
• Project team involved with projects that need this
• Experience with OAI-PMH: widely used in repos but o XML metadata only o Attempts at synchronizing actual content via OAI-PMH
(complex object formats, dc:identifier) not successful. o Web technology has moved on since 1999
• Devise a shared solution for data, metadata, linked data?
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Use Cases – The Basics
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Use Cases - More
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Use Case: arXiv Mirroring
• 1M article versions, ~800/day created or updated at 8 PM US Eastern Time
• Metadata and full-text for each article
• Accuracy important
• Want low barrier for others to use
• Look for more general solution than current homebrew mirroring (running with minor modifications since 1994!) and occasional rsync (filesystem layout specific, auth issues)
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Use Case: DBpedia Live Duplication
• Average of 2 updates per second • Want low latency => need a push technology
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync
ResourceSync: What & Why?
Problem Perspective & Conceptual Approach
Possible Technical Choices
Q&A
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync Problem
• Consideration: • Source (server) A has resources that change over time: they
get created, modified, deleted • Destination (servers) X, Y, and Z leverage (some) resources
of Source A. • Problem:
• Destinations want to keep in step with the resource changes at source A: resource synchronization.
• Goal: • Design an approach for resource synchronization aligned
with the Web Architecture that has a fair chance of adoption by different communities. • The approach must scale better than recurrent HTTP
HEAD/GET on resources.
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Destination: 3 Basic Synchronization Needs
1. Baseline synchronization – A destination must be able to perform an initial load or catch-up with a source
- avoid out-of-band setup
2. Incremental synchronization – A destination must have some way to keep up-to-date with changes at a source
- subject to some latency; minimal: create/update/delete - allow to catch-up after destination has been offline
3. Audit – A destination should be able to determine whether it is synchronized with a source
- subject to some latency
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source Capability 1: Describing Content
In order to advertise the resources that a source wants destinations to know about, it may describe them:
o Publish an inventory of resource URIs and possibly associated metadata - Destination GETs the Content Description - Destination GETs listed resources by their URI
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source Capability 2: Communicating Change Events
In order to achieve lower latency, a source may communicate about changes to its resources:
o 2.1. Change Set: Publish a list of recent change events (create, update, delete resource) - Destination acts upon change events, e.g. GETs created/
updated resources, removes deleted resources.
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source Capability 2: Communicating Change Events
In order to achieve lower latency, a source may communicate about changes to its resources:
o 2.1. Change Set: Publish a list of recent change events (create, update, delete resource) - Destination acts upon change events, e.g. GETs created/
updated resources, removes deleted resources.
o 2.2. Push Change Set: Push a list of recent change events (create, update, delete resource) towards (a) destination(s) - Destination acts upon change events, e.g. GETs created/
updated resources, removes deleted resources.
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source Capability 3: Providing Access to Versions
In order to allow a destination to catch up with missed changes, a source may support:
o 3.1. Historical Change Sets: Provide access to change events that occurred prior to the ones listed in the current Change Set
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source Capability 3: Providing Access to Versions
In order to allow a destination to catch up with missed changes, a source may support:
o 3.1. Historical Change Sets: Provide access to change events that occurred prior to the ones listed in the current Change Set
o 3.2. Historical Content: Provide access to prior resource versions
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source Capability 4: Transferring Content
By default, content is transferred in response to a GET issued by a destination against a URI of a source’s resource. But a source may support additional mechanisms:
o 4.1. Dump: Publish a package of resource representations and necessary metadata - Destination GETs the Dump - Destination unpacks the Dump
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source Capability 4: Transferring Content
By default, content is transferred in response to a GET issued by a destination against a URI of a source’s resource. But a source may support additional mechanisms:
o 4.1. Dump: Publish a package of resource representations and necessary metadata - Destination GETs the Dump - Destination unpacks the Dump
o 4.2. Alternate Content Transfer: Support alternative mechanisms to optimize getting content (see later)
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source: Advertise Capabilities
A source needs to advertise the capabilities it supports to allow a destination to discover them
• Some capabilities may be provided by a third party, not the source itself
o e.g. Historical Change Sets, Historical Content o But the source should still make those third party capabilities
discoverable - trust
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync
ResourceSync: What & Why?
Problem Perspective & Conceptual Approach
Possible Technical Choices
Q&A
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync: A Framework of Capabilities
• Modular framework allowing selective deployment of capabilities
• A Source selects which capabilities to support in order to meet local and community needs
• A Source’s Capabilities can be discovered via capability descriptions
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
BY REFERENCE!
BY VALUE!
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
• About the changed resource: • URI • Information relevant for audit, e.g. fixity, size, mime type • Further information to aide accessing the resource (see
later)
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Change Communication – Push Change Sets
• Use a push technology to convey changes
• Express changes using same Sitemap-style document • A Change Set in this case might convey only one change
event
• Possible technology: XMPP PubSub
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
<XMPP PubSub Intermezzo>
XMPP Publish-Subscribe: Client to Subscription Service, Subscription Service to Client(s) communication
• One of the XMPP (Extensible Messaging and Presence Protocol) extensions http://xmpp.org/extensions/xep-0060.html
• Apple Notifications based on XMPP PubSub
• Both client and server tools widely available
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
</XMPP PubSub Intermezzo>
Source Destination PubSub Server
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Change Communication Memory
• Publication of one or more Change Sets that convey historical (rather than recent) changes
• All historical Change Sets use same Sitemap-style document
• Same approach irrespective of whether pull or push is used for Change Communication
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Resource Transfer
• Resources are obtained in bulk by obtaining a Dump
• An individual resource is, by default, obtained by dereferencing a resource’s URI listed in: • Sitemap • Change Set
• Alternative access mechanisms are introduced to obtain an individual resource: • From a mirror site • Access to diff with previous version instead of access to the
entire changed resource • Resource version
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Resource Memory
• Requires a (short or long term) archive of resource versions
• Access to specific version can be expressed as an alternative access mechanism in e.g. Change Set. • Via a link to a version resource that is the result of the
change expressed in the Change Set • Via a link to a Memento TimeGate that supports access to all
available prior versions
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
<Memento Intermezzo>
http://www.mementoweb.org/
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Original Resources and Mementos
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Bridge from Present to Past
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Bridge from Past to Present
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Memento Framework
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Original Resource: http://lanlsource.lanl.gov/pics/picoftheday.png
Memento Framework
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Time Travel across Versions of a Picture of the Day
Movie at: http://www.mementoweb.org/demo/picoftheday.mov
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Original Resource: http://dbpedia.org/resource/France
Memento Framework
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Time-Series Analysis across DBpedia Versions
Data collected through HTTP Navigation
Paper at http://arxiv.org/abs/1003.3661
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
</Memento Intermezzo>
http://www.mementoweb.org/
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync Timeline • August 2012
o First draft spec shared for feedback with ResourceSync team
• September 2012 o Problem Statement paper in D-Lib Magazine o In-person meeting of ResourceSync Team
• October 2012 o Revise spec, conduct experiments o Solicit broad feedback
• December 2012 – Finalize specification (?)
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Pointers • First ResourceSync draft spec (do not implement!):
http://www.openarchives.org/rs/0.1/resourcesync!
• ResourceSync Simulator code on github http://github.org/resync/simulator!
• NISO ResourceSync workspace http://www.niso.org/workrooms/resourcesync/!
• Memento http://mementoweb.org!
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync: Get the Sticker!
Herbert Van de Sompel Los Alamos National Laboratory
@hvdsomp
ResourceSync is funded by The Sloan Foundation & JISC