Top Banner
Reference Rot and Linked Data: Threat and Remedy PRELIDA 18/19th October 2014 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh for the Hiberlink Team at University of Edinburgh & LANL Research Library
40

Reference Rot and Linked Data: Threat and Remedy

Nov 22, 2014

Download

Education

Delivered by Peter Burnhill, Director of EDINA, at the PRELIDA Consolidation and Dissemination workshop on 17/18 October 2014 (http://prelida.eu/consolidation-workshop).

Summary: The web changes over time, and significant reference rot inevitably occurs. Web archiving delivers only a 50% chance of success. So in addition to the original URI, the link should be augmented with temporal context to increase robustness.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Reference Rot and Linked Data: Threat and Remedy

Reference Rot and Linked Data: Threat and Remedy

PRELIDA18/19th October 2014

Funded by the Andrew W. Mellon Foundation

Peter Burnhill EDINA, University of Edinburgh

for the Hiberlink Team at University of Edinburgh & LANL Research Library

Page 2: Reference Rot and Linked Data: Threat and Remedy

The Project Team 2013 – 2015, funded by the

Andrew W. Mellon Foundation

• Los Alamos National Laboratory:

Research Library: Martin Klein, [Rob Sanderson], Harihar Shankar, Herbert Van de Sompel

• University of Edinburgh:

Language Technology Group: Beatrice Alex, Claire Grover, Colin Matheson, Richard Tobin, [Ke “Adam” Zhou]

EDINA * : Neil Mayo, Muriel Mewissen (Project Manager), Tim Stickland, Richard Wincewicz, Peter Burnhill

Centre for Service Delivery & Digital Expertise

Funded by the Andrew W. Mellon Foundation

PRELIDA18/19th October 2014

Page 3: Reference Rot and Linked Data: Threat and Remedy

3

1. Social Science Research Council [now ESRC, UK]– ‘Scientific Officer’

2. Scottish Education Data Archive, 1979 – 1984/1987– Survey statistician: school leavers, YTS, 16-19 cohort surveys; demand for HE

3. Edinburgh University Data Library, 1984 to present– President of IASSIST, 1997 – 2001: social science data professionals

4. ESRC Regional Research Laboratory for Scotland, 1986 -1990– Co-director, early days of Geographical Information Systems (GIS)– member of Data Task Force, UK Inter-Agency Global Env. Change

5. Graduate School, Faculty of Social Science, UofEd 1987 – 1997– Senior Lecturer (p/t), teaching quantitative/survey methods– Director of RAPID: ESRC Research Activity & Publications Information Database

6. EDINA national data centre, 1995/6 to present– Director: set-up and continuous development; Jisc-funded UK national services

7. UK Digital Curation Centre (DCC), 2003/04 - 2004/05

– Director for set-up & definition of ‘data curation + digital preservation’

8. CLOCKSS Founder & Board Member / LOCKSS deployment

Data Manufacturing

Data Brokering

funding Data & use of Data

Spatial Data & MetaData

Page 4: Reference Rot and Linked Data: Threat and Remedy

licence to use

Ensuring researchers, students and their teachers have

ease and continuing accessto online resources used for scholarship

“ease” “continuing”

P.Burnhill, Edinburgh 2009

usability

open

preservationrestricted

access to content & services

security & integrity of medium

replication

usability of format

back content

Semantics

Page 5: Reference Rot and Linked Data: Threat and Remedy

Buckland: thinking about Digital Libraries

mix of the document tradition (signifying objects & their use)

& the computation tradition (applying algorithmic, logical, mathematical, and mechanical techniques to information management)

“Both traditions are needed. Information Science is rooted in part in humanities and qualitative social sciences. The landscape of Information Science is complex. An ecumenical view is needed.”– M.Buckland, Journal of American Society for Information Science, 50, 1999

2 (non-convergent) mentalities, Document-ness & Computation

+ a third dimension, the domain of application:

• Academic discipline – if we do this for ourselves• Business area – if we do this for use beyond …

Page 6: Reference Rot and Linked Data: Threat and Remedy

Related Activity by Partners

• Los Alamos National Laboratory Research Library:

• Memento • ResourceSync

• http://www.niso.org/workrooms/resourcesync/

• University of Edinburgh / Informatics / Language Technology Group:

• Text mining / Edinburgh Parser• University of Edinburgh / Jisc / EDINA :

• CLOCKSS / LOCKSS • Keepers Registry

• https://www.era.lib.ed.ac.uk/handle/1842/6682

Page 7: Reference Rot and Linked Data: Threat and Remedy

Picture credit: http://somanybooksblog.com/2009/03/27/library-tour/

But online articles in the Scholarly Record are not in the custody of Libraries, nor on their digital shelves.

Top level Problem: We would like to assume that our libraries are ensuring that online e-journal content is being kept safe

Page 8: Reference Rot and Linked Data: Threat and Remedy

Evidence from <thekeepers.org> is worrying!

The Keepers Registry aggregates what is being kept by the (10) leading archiving agencies (CLOCKSS, Portico, national libraries etc) with all issued with ISSN

① ‘Ingest Ratio’ = titles being ingested by one or more Keeper/ ‘online serials’ in ISSN Register

= 23,268 / 136,965 [in March 2014] => 17%

* We do not know about 83% of e-serials having ISSN *

‘KeepSafe Ratio’ = ingest by 3+ Keepers = 9,652 / 136,965 => 7%

② Title Lists of 3 US research libraries (Columbia, Cornell & Duke),

checked i2011/12 ‘Ingest Ratio’ = 22% to 28%; c.75% unknown fate

③ User-centric Evidence, UK usage in 2012, UK OpenURL Router logs

=> over two thirds 68% (36,326 titles) held by none!

Page 9: Reference Rot and Linked Data: Threat and Remedy

Memento The Memento "Time Travel for the Web" protocol

http://mementoweb.org/

• an interoperable approach to access web archives (IETF RFC 7089)

• adopted by all major public archives worldwide, including the Internet Archive.

• Memento for Chrome http://bit.ly/memento-for-chrome

• This protocol underpins the work being done in Hiberlink

Page 10: Reference Rot and Linked Data: Threat and Remedy

Now, about Reference Rot & Linked Data …

1. Some definitions

• What is Reference Rot?

• What may be special about Linked Data?

2. Evoking metaphor

• The moment / snapshot / memento

• Flash-freezing to avoid or to stop the rot (of fruit on vine)

3. Evidence of Threat of Reference

4. Devising Remedy for Reference Rot

• Proposals for intervention: plug-ins & infrastructural solutions

5. Next Steps: how to take this work forward?

Page 11: Reference Rot and Linked Data: Threat and Remedy

Reference Rot = Link Rot + Content Drift

“when links to web resources no longer point to what they once did”

Investigating Reference Rot in Web-Based Scholarly Communication

Page 12: Reference Rot and Linked Data: Threat and Remedy

Link Rot

‘Link Rot’

Page 13: Reference Rot and Linked Data: Threat and Remedy

+ Content Drift: What is at end of URI has changed, or gone!

http://dl00.org

2000

http://dl00.org

2004

http://dl00.org

2005

http://dl00.org

2008

(a) Dynamic contentas values on webpage changes over time

(b) Static contentbut very different (often unrelated) web pages

Page 14: Reference Rot and Linked Data: Threat and Remedy

What of Linked Data?

One or more sets of 3 linked URIs: conversation or statements for the long term? As time passes, so the content at the end of each of those URIs will suffer:

Reference Rot = Link Rot + Content Drift

“when links to web resources no longer point to what they once did”

“Adding eScience Assets to the Data Web”, Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson, Simeon Warner, Robert Sanderson, Pete Johnston. Proceedings of Linked Data on the Web (LDOW2009) Workshop, [v1] Thu, 11 Jun 2009 15:33:37 GMT http://arxiv.org/abs/0906.2135v1

Page 15: Reference Rot and Linked Data: Threat and Remedy

Example: ‘mark up’ archaeological site record (metadata)

Page 16: Reference Rot and Linked Data: Threat and Remedy
Page 17: Reference Rot and Linked Data: Threat and Remedy

RDF graph: Article & Supplementary Data http://www.emeraldinsight.com/fig/0350570303002.png

1. Build and publish as metadata in XML format to be found on the web

2. Publishing text and data/multimedia content in XML will delight researchers• Researchers want to access ‘article as data’, via computational algorithm

Page 18: Reference Rot and Linked Data: Threat and Remedy

What we are doing in Hiberlink

1. Creating evidence on extent of ‘Reference Rot’

– Main focus has been on references (& URIs) made in Journal Articles

• Inc. reference rot in Supreme Court judgments with Harvard Law Library & permaCC

– ETD2014 was opportunity to look at Reference Rot & the e-Thesis

– PRELIDA is opportunity to look at impact on Linked Data

2. Understanding the preparation/publication/ingest workflow(s)

– Identifying opportunity for productive intervention

3. Prototypes for pro-active archiving to enable remedy

– Embedding such ‘solutions’ in existing tools & infrastructure

4. Raising awareness & seeking collaborative actions

…. through events like this

Page 19: Reference Rot and Linked Data: Threat and Remedy

Empirical evidence on the Threat of Reference Rot

Large-scale analyses: Journal Articles & E-Theses

Page 20: Reference Rot and Linked Data: Threat and Remedy

Methodology: to discover answer to 2 questions

i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?

• Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and regarding an 2XX status code as ‘live’

Page 21: Reference Rot and Linked Data: Threat and Remedy

Methodology: to discover answer to 2 questions

i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?

• Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and regarding an 2XX status code as ‘live’

ii. Is there a ‘Memento’ of that reference in the ‘Archived Web’?

Memento: a prior version, what the Original Resource was like at some time in the past.

Page 22: Reference Rot and Linked Data: Threat and Remedy

A Measure of Reference Rot: Are those references available? [in 6,400 e-Theses defended in 2003-2010 at 5 US universities]

Less than two-thirds of those links lead

to live content

Live on Web Not Found on ‘Live Web’ All

Count 29,122 16,860 45,982

% 63.3 36.7 100%

1st Order Indicator of ‘Reference Rot’ more than one

third of references to the Web subject to ‘rot’

After up to 50 redirects

Page 23: Reference Rot and Linked Data: Threat and Remedy

References in Citations Rot over Time:URIs cease to exist on the live Web

[excluding 0s&1s: a few theses are unaffected; a few are ruined]

We can’t stop that process of rot: Web content changes over time,

Reference Rot is inevitable function of time

Number of months elapsed from Date Thesis Defended until date archives checked (June 2014)

Page 24: Reference Rot and Linked Data: Threat and Remedy

Searching for ‘Datetime’ Mementos of content in ‘Archived Web’ [in 6,400 e-Theses defended in 2003-2010 at 5 US universities]

% Live on Web Not found on ‘Live Web’ All

Found to be Archived

47.6

Not Found 52.4

All 100%

There seems a 50:50 chance that referenced content is in the ‘Archived Web’.

Some content is being ‘co-incidentally harvested’ by routine web archiving.

=> half of those references are at ‘risk of loss’

Page 25: Reference Rot and Linked Data: Threat and Remedy

‘Incidental Archiving’ is constant over time (This is an ‘upper bound estimate’, independent of age of e-thesis)

We can improve upon this ‘50:50 chance’ by pro-actively archiving what we cite

Page 26: Reference Rot and Linked Data: Threat and Remedy

We already have ‘Lost Content’ for References to Web[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]

% Live on Web Not found on ‘Live Web’ All

Found to be Archived

29.3 18.3 47.6

Not Found 34.0 18.4 52.4

All 63.3 36.7 100%

18.4%‘not live & not found in archive’judged to be lost forever

34%‘live’ & ‘not in archive’

at is risk of loss

NB: The 34% ‘at risk’ could be saved by pro-active archiving

Page 27: Reference Rot and Linked Data: Threat and Remedy

Hiberlink Next Phase: in-depth study of Content Drift

But demonstrated that problem exists & is severe

• The Web changes over time: significant reference rot occurs

• Routine Web Archiving delivers no better than 50:50 chance of success of having co-Incidentally archived what you referenced

- and probably much less chance when we check extent of content drift

- Not (yet) studied impact on Linked Data but expect similar

Page 28: Reference Rot and Linked Data: Threat and Remedy

“Researchers need to know when information on a viewed page has changed.

“Authors of long-shelf-life material want to be sure that their links will still work far into the future.

Jonathan Zittrain, Larry Lessig and Kendra Albert report that

• Harvard Law Review75% of links are dead

• top 1% Impact Factor Journals

10% of links dead just 15 months after publication

• US Supreme Court decisions

29% of links dead

49% of links do not point to the original target

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2329161

Page 29: Reference Rot and Linked Data: Threat and Remedy

Devising Remedy for Reference Rot for Linked Data?

Page 30: Reference Rot and Linked Data: Threat and Remedy

Seek pro-active ‘transactional archiving’ solutions

– focus on what is regarded by authors as important

a) Understand the preparation/publication workflow – identifying where there can be productive intervention

b) Devise prototypes for pro-active archiving – writing & implementing code!

c) Propose/test infrastructure for temporal referencing – supporting & using the Memento protocol

Where possible, we wish to embed ‘solutions’ in existing tools & infrastructure

Strategy for Making Remedy

Page 31: Reference Rot and Linked Data: Threat and Remedy

3 workflows in scholarly statement

Extended length of stages in workflows magnify reference rot & affect, as referenced content on the web rots over time

① Preparation-> Study - > Compose -> (Review) -> Submission

② Publication -> (Editorial)Examination -> (Revision) -> Acceptance -> Issue

③ Post-Publication-> Deposit/Ingest -> Provide/Access -> Use

Identify the best opportunities for Intervention to make Remedy, to ‘flash-freeze’, either to avoid reference rot or to ‘stop the rot’

What are the key workflows for the manufacture, release and use of Linked Data?

Page 32: Reference Rot and Linked Data: Threat and Remedy

3 workflows in Linked Data

What is it that changes over time: concepts, assigned attributes; why and on what timescale?

① Manufacture-> Create- > (Review) -> Prepare to publish/release/commit

② Authority: Release-> (Editorial)Examination -> (Revision) -> Acceptance

③ Use: Curate -> Deposit/Ingest -> Provide/Access -> Use

Identify the best opportunities for Intervention to make Remedy, to ‘flash-freeze’, either to avoid reference rot or to ‘stop the rot’

What are the key workflows for the manufacture, release and use of Linked Data?

Page 33: Reference Rot and Linked Data: Threat and Remedy

1. Hiberlink Plug-in - for pro-active ‘transactional’ archiving

– At the time of authoring (ie manufacture)

2. Missing Link - re-factoring the HTML link

– By which one annotates with {DateTime; location of archived copy/ies}

3. HiberActive - a system for actively archiving references

– Designed to ‘stop the rot’, a lossy 2nd Best to transactional archiving’

LANL: Martin Klein, Harihar Shankar, Herbert Van de Sompel

UoEd EDINA: Neil Mayo, Tim Stickland, Richard Wincewicz

‘Work in progress’ to effect Remedy

HiberlinkETD2014, Leicester UK July 25th 2014

Funded by the Andrew W. Mellon Foundation

Page 34: Reference Rot and Linked Data: Threat and Remedy

For use during authoring [manufacture] of information object & before final issuebut also

before ingest by ‘library’ (& maybe for repair by ‘library’ …)

Hiberlink Plug-in [for Zotero] ① Triggers archiving of referenced web content

② Returns DateTime URI for archived content

Page 35: Reference Rot and Linked Data: Threat and Remedy

1. Hiberlink Plug-in - to enable pro-active archiving

2. Missing Link - re-factor the HTML link that is returned

‘Work in progress’ to effect Remedy (2)

b) Augment Link with a set of Datetime & location pairs

a) Take simple URI - to French National Library (say)

Prepared by:Herbert Van de Sompel, Martin Klein, Robert Sanderson - Los Alamos National Laboratory Michael Nelson - Old Dominion University

http://mementoweb.org/missing-link/

Page 36: Reference Rot and Linked Data: Threat and Remedy

1. Hiberlink Plug-in - to enable pro-active archiving

2. Missing Link - re-factoring the HTML link

First two approaches support ‘perfect scenario’:

• All authors archive all their cited URIs

• e.g. (but not exclusively) with Hiberlink / Zotero

3. HiberActive

– Enables repositories to ‘stop the rot’ by actively archiving those references in e-theses

– A notification hub, a component for the infrastructure

• testing workflow with ResourceSync, CORE & external archive programme

‘Work in progress’ to effect Remedy (3)

Page 37: Reference Rot and Linked Data: Threat and Remedy

• The Web changes over time: significant reference rot inevitably occurs (as a function of time)

• Web Archiving delivers only c.50:50 chance of success of co-incidentally archiving what you referenced

• Link by means of the original URI, at time of manufacture

• But then …. Augment the link with temporal context, to increase robustness of link to referenced content

o Date of linkingo URI of archived snapshot(s)

• Then again, maybe this is all about archiving to support citation and not really about ‘preservation’, but it does assist continuity of access

Summary

Page 38: Reference Rot and Linked Data: Threat and Remedy

Picture credit: http://somanybooksblog.com/2009/03/27/library-tour/

Multi-level Problem: Digital Shelving for The Research Object; First Order References; Second Order References; ….

Simple Statements [with URIs]

1st Order References [with URIs]

Complex Research Objects {URIs}

1st Order References {URI}

2nd Order References {URI}2nd Order References [with URIs]

“Digital information is best preserved by replicating it [on digital shelving] at multiple archives run by autonomous organizations”

B. Cooper and H. Garcia-Molina (2002)

Page 39: Reference Rot and Linked Data: Threat and Remedy

Next Steps: how to take this work forward?to ensure URI/references don’t rot

• Need to move from the ‘incidental Web archiving’ of cited URIs to pro-active archiving, by makers of Linked Data & by repositories?

• Engage with these Hiberlink remedies

• The Hiberlink Plug-in for Zotero / HiberActive

Email: [email protected]

Subject: Hiberlink ETD

Page 40: Reference Rot and Linked Data: Threat and Remedy

Thank you, Questions welcome

& check:http://hiberlink.org/news.html

http://hiberlink.org #hiberlink

Funded by the Andrew W. Mellon Foundation

Email: [email protected]