Top Banner
Mind the gap! Reflections on the state of repository data harvesting Simeon Warner (Cornell University) http://orcid.org/0000-0002-7970-7855
23

Mind the gap! Reflections on the state of repository data harvesting

Jan 29, 2018

Download

Technology

Simeon Warner
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mind the gap! Reflections on the state of repository data harvesting

Mind the gap!

Reflections on the state of repository data harvesting

Simeon Warner (Cornell University) http://orcid.org/0000-0002-7970-7855

Page 2: Mind the gap! Reflections on the state of repository data harvesting

Long long ago, when XML was hard,

Unicode was merely one possible character set,

a big hard drive was 10GB, and HotBot & AltaVista had a new competitor...

Page 3: Mind the gap! Reflections on the state of repository data harvesting

... it was1999 and the UPS meeting in Santa Fe aimed to

“... identify technologies to stimulate the adoption of the concept of [Open

Access] author self-archived systems in scholarly communication; theorize a framework for the integration of e-

print services in the academic document system ...”

https://www.openarchives.org/meetings/SantaFe1999/ups-invitation-ori.htm

Page 4: Mind the gap! Reflections on the state of repository data harvesting

Thus was born OAI-PMH

v1.0 2001, v1.1 2002, v2.0 2003

Page 5: Mind the gap! Reflections on the state of repository data harvesting

OAI-PMH was great! •  It works •  Scales to millions of items •  Easy to implement (good s/w libraries) •  XML, which brought UTF-8 (hurrah!) •  Widely deployed, stable since 2003 (v2.0) •  Registries & validators •  Community & documentation

Page 6: Mind the gap! Reflections on the state of repository data harvesting

BASE harvests >5000 sources

>112M documents

Page 7: Mind the gap! Reflections on the state of repository data harvesting
Page 8: Mind the gap! Reflections on the state of repository data harvesting

BUT... •  Not RESTful •  Repository-centric •  XML metadata only •  Metadata is wrapped •  Dynamic set membership bug

Page 9: Mind the gap! Reflections on the state of repository data harvesting

"Currently, OAI-PMH is the only behavior that is uniformly exposed by

most repositories.

[But], its focus on metadata, its pull-based paradigm, and its technological roots that date back to the web of the nineties put it at odds with ... current

web technologies." COAR Next Generation Repositories

http://comment.coar-repositories.org/2-next-generation-repositories/

Page 10: Mind the gap! Reflections on the state of repository data harvesting

Photo by drivethrucafe CC BY-SA https://www.flickr.com/photos/128758398@N07/15836296662

Page 11: Mind the gap! Reflections on the state of repository data harvesting

Google Scholar is great, but

not the answer

Page 12: Mind the gap! Reflections on the state of repository data harvesting

Replacement with no gap New approach must: •  Meet existing OAI-PMH use cases •  Support content as well as metadata •  Scale better •  Follow web standards •  Be modern, developer friendly

Page 13: Mind the gap! Reflections on the state of repository data harvesting

Push-me pull-you

many items / sources low latency / efficiency => push/notification

modest size low barrier => pull

Page 14: Mind the gap! Reflections on the state of repository data harvesting

Conclusion v1

We, the repository community, need to discuss and agree on a new approach to

harvesting

Page 15: Mind the gap! Reflections on the state of repository data harvesting

ResourceSync ANSI/NISO Z39.99-2017 Sitemaps + •  multiple sets •  fixity •  links •  changes only •  dumps

Page 16: Mind the gap! Reflections on the state of repository data harvesting

+ Notifications (Push) PubSubHubbub WebSub •  low latency •  efficiency

Page 17: Mind the gap! Reflections on the state of repository data harvesting

CORE >6000 journals

>2400 repositories >77M articles

(>6M full text)

metadata + content

Page 18: Mind the gap! Reflections on the state of repository data harvesting

Slide from Petr Knoth / CORE – DPLAfest 2017 presentation -- https://goo.gl/vz3zuJ

Tested with resync client. 20 x 25MB sitemaps,

1M items ✔

Page 19: Mind the gap! Reflections on the state of repository data harvesting

IIIF & Europeana •  500,000,000+ IIIF resources – how to

find them? •  JSON-LD documents and related web

pages •  Europeana experiments with NLW and

UCD o  ResourceSync, Sitemaps and native

structures

Page 20: Mind the gap! Reflections on the state of repository data harvesting

Hyku & DPLA •  Extension of HydraSamvera codebase

to provide in-the-box repository •  Native ResourceSync support

o  Both resource lists and change lists •  Successful harvesting tests with DPLA

o  Desire for resource dumps and change dumps for efficiency

(see new report: http://hydrainabox.projecthydra.org/2017/06/22/resourcesync.html )

Page 21: Mind the gap! Reflections on the state of repository data harvesting

Conclusion v2

We, the repository community, should

agree on & transition to ResourceSync as the

new approach to harvesting

Page 22: Mind the gap! Reflections on the state of repository data harvesting

Repository prescription •  Metadata and content should be web

resources o  stable URIs, follow web standards, not hidden

behind query interfaces •  Support ResourceSync as the primary

harvesting interface o  OAI-PMH as secondary where necessary

•  Distinguish and relate metadata and content entries

Page 23: Mind the gap! Reflections on the state of repository data harvesting

That’s all folks @zimeon [email protected]