How much preservation do I get if I do absolutely nothing? Using the Web Infrastructure for Digital Preservation Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA, USA {mln,fmccown,jsmit,mklein}@cs.odu.edu Media Production Berlin 2006 Berlin, Germany December 8, 2006 Research supported in part by NSF, Library of Congress and Andrew Mellon Foundation
32
Embed
Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University
How much preservation do I get if I do absolutely nothing? Using the Web Infrastructure for Digital Preservation. Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA, USA {mln,fmccown,jsmit,mklein}@cs.odu.edu Media Production Berlin 2006 - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
How much preservation do I get if I do absolutely nothing?
Using the Web Infrastructure for Digital Preservation
Michael L. Nelson, Frank McCown, Joan A. Smith, Martin KleinOld Dominion University
Norfolk VA, USA
{mln,fmccown,jsmit,mklein}@cs.odu.edu
Media Production Berlin 2006
Berlin, Germany
December 8, 2006
Research supported in part by NSF, Library of Congress and Andrew Mellon Foundation
• How much digital preservation of websites is afforded by lazy preservation?– Can we reconstruct entire websites from the WI?– What factors contribute to the success of website
reconstruction?– Can we predict how much of a lost website can be
recovered?– How can the WI be utilized to provide preservation of
server-side components?
Warrick: Crawling the Crawlers
• Is website reconstruction from WI feasible?– Web repository: G,M,Y,IA– Reconstructed 24 websites
• How long do search engines keep cached content after it is removed?
SE Caching Experiment
• Create html, pdf, and images• Place files on 4 web servers• Remove files on regular schedule• Examine web server logs to determine
when each page is crawled and by whom• Query each search engine daily using
unique identifier to see if they have cached the page or image
Joan A. Smith, Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites. D-Lib Magazine, February 2006, 12(2)
3. Store most recently cached version or canonical version
4. Parse html for links to other resources
How Much Did We Reconstruct?
A
“Lost” web site Reconstructed web site
B C
D E F
A
B’ C’
G E
F
Missing link to D; points to old resource G
F can’t be found
Reconstruction Diagram
added 20%
identical 50%
changed 33%
missing 17%
Websites to Reconstruct
• Reconstruct 24 sites in 3 categories:1. small (1-150 resources) 2. medium (150-499 resources)3. large (500+ resources)
• Use Wget to download current website• Use Warrick to reconstruct• Calculate reconstruction vector
Results
Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.
• passive, “piggybacking”• History list of receiver domains
– not maintained; history pointer off» duplicates
– maintained; history pointer on» no duplicates
• Granularity Filter for emails– every Gth email will be processed
SMTP Results no history pointer with history pointer
G = 1
Summary
• Shared Infrastructure Preservation provides a communications channel with unknown, future trading partners– SMTP approach is only feasible for “advertising” the existence of
the repository– NNTP approach is promising for holding content
• Lazy Preservation has been used to restore several dozen websites– but is it an archival strategy? depends on your tolerance for risk– prediction: search engines will see preservation as a business