Digital Preservation Research at Old Dominion University
Justin F. Brunelle
The MITRE Corporation
Old Dominion University
(And hopefully MITRE, soon)
Why are we listening?
• Overview of the problem
• BRIEF introduction to ODU WSDL group research
• Memento
• I’ll be skipping around, so don’t hesitate to interrupt me
Digital Preservation
• Using the past Web– Focus of our research
• Temporal Browsing– Sessions in the past
• Recovering Lost Pages– Is it really gone?
• 404s– How to fix broken links?
1
same URI maps to same or very similar content at a later time
2
same URI maps to different content at a later time
3
different URI maps to same or very similar content at the same or at a later time
4
the content can not be found at any URI
U1
C1
U1
C1
timeA B
U1
C2
U1
C1
timeA B
U2
C1
U1
C1
U1
404
timeA B
U1
??
U1
C1
timeA B
Change on the Web
Time to Talk About Saving Everything?
Dinner for one or two costs more than 1TB disk Wikis have popularized versioning
Cool URIs (http://www.w3.org/Provider/Style/URI.html) are widely adopted, e.g.:http://news.yahoo.com/s/ap/20100920/ap_on_el_se/us_alaska_senate http://d.yimg.com/a/p/ap/20100918/capt.67567dbc0a874b689f0b4a5c392f379c-67567dbc0a874b689f0b4a5c392f379c-0.jpghttp://d.yimg.com/a/p/afp/20100918/thumb.photo_1284846332993-1-0.jpg
Also related projects with cool URI / permalink focus: http://www.citability.org/ http://data.gov/ http://data.gov.uk/
Fortress Model
• Get a lot of money
• Buy lots of storage
• Hire lots of people
• “Look upon my archive ye Mighty, and despair!”
Alternate Methods
• Lazy Preservation (McCown)– “How much preservation do I get if I do absolutely
nothing?”• Just-In-Time Preservation (Klein)
– Wait for it to disappear, then find a “good ‘nuff” version
• Shared Infrastructure Preservation– Push content to sites that might preserve it
• arXiv.org, IA, WebCite…
• Server Enhanced Preservation– Create archival-ready resources
And Soon…
• Social Preservation– Preserving resources using 3rd party Web Services
– Repository for OAI-ORE ReMs
– Social network feel
– Lazy-esque, server-side reconstruction
But I digress…
• Few years away…
• Preliminary research
• And now back to the prior research…
Web Infrastructure (McCown, 2007)
WayBack Machine
http://web.archive.org/web/*/http://www.thecribs.com/http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.thecribs.com/
from these we can create time-based: • indexes• IDF values• PageRank
Batch Recovery For Sites
http://warrick.cs.odu.edu/
Free limo rides for life?!
13
Reconstruction Diagram
added 20%
identical 50%
changed 33%
missing 17%
Real-Time Recovery for URIs
Synchronicity - www.cs.odu.edu/~mklein/
Memento wants to make navigating the Web’s Past Easy
15
http://www.mementoweb.orghttp://groups.google.com/group/memento-dev
What are you talking about?
• Universal Resource Identifier (URI) ~= URL
• Resource:– <HTML>
• Representation
W3C Web Architecture: Resource – URI - Representation
Resource
Representation
Represents
URI
Identifies
dereference
17
dereference content negotiation
W3C Web Architecture: Resource – URI - Representation
Resource
URI
Identifies
Representation 1
Represents
Representation 2Represents
18
Resources
19
Resources have Representations
20
Resources have Representations that Change over Time
21
Only the Current Representation is Available from a Resource
22
Old Representations are Lost Forever
23
Finding Archived Resources
Go to http://www.archive.org/ and searchhttp://cnn.com
On http://web.archive.org/web/*/http://cnn.com, select desired datetime
24
Archived Resources
http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com
http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived
resource for http://en.wikipedia.org/wiki/September_11_attacks
Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC
25
Navigating Archived Resources
http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived
resource for http://en.wikipedia.org/wiki/September_11_attacks3
Dec 20 2001, 4:51:00 UTC
http://en.wikipedia.org/wiki/The_Pentagon
current
Pentagon
26
Current and Past Web are Not Integrated
27
• Current and Past Web based on same technology.
• But, going from Current to Past Web is a matter of (manual) discovery.
• Memento wants to make going from Current to Past Web a (HTTP) protocol matter.
• Memento wants to integrate Current And Past Web.
One Memento HTTP Navigation
28
Memento HTTP FlowHEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
One Memento HTTP Navigation
30
Scenario
• cnn.com includes Link to TimeGate at Internet Archive• URI-R on one server, URI-G & URI-M on another
Memento HTTP FlowHEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: URI-RHEAD R, Accept-Datetime
HEAD http://cnn.com/ HTTP/1.1Host: cnn.comAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close
32
Memento HTTP FlowHEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: Success – URI-RLinkG
HTTP/1.1 200 OKDate: Thu, 21 Jan 2010 00:02:12 GMTServer: ApacheLink: <http://web.archive.org/web/timegate/http://cnn.com>; rel="timegate"Content-Length: 255Connection: closeContent-Type: text/html; charset=iso-8859-1
34
Memento HTTP FlowHEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
GET G, Accept-Datetime
Memento HTTP Flow: URI-G
GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1Host: web.archive.orgAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close
36
Memento HTTP FlowHEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: Success – URI-G
302M, Vary, LinkR,B,M
HTTP/1.1 302 FoundDate: Thu, 21 Jan 2010 00:06:50 GMTServer: ApacheTCN: choiceVary: negotiate, accept-datetimeLocation: http://web.archive.org/web/20010911203610/http://www.cnn.comLink: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first- memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT”Content-Length: 0Connection: closeContent-Type: text/plain; charset=UTF-8
38
Memento HTTP FlowHEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: URI-M
GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1Host: web.archive.orgAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close
40
Memento HTTP FlowHEAD R, Accept-Datetime
LinkG
302M, Vary, TCN, LinkR,B,M
200, Content-Datetime, LinkR,B,M
GET G, Accept-Datetime
GET M, Accept-Datetime
Memento HTTP Flow: Success – URI-M
200, Content-Datetime, LinkR,B,M
HTTP/1.1 200 OKServer: Apache-Coyote/1.1X-Archive-Orig-Accept-Ranges: bytes…Content-Type: text/html;charset=utf-8Content-Length: 23364Date: Thu, 21 Jan 2010 00:09:40 GMTContent-Datetime: Tue, 11 Sep 2001 20:36:10 GMTLink: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT”Connection: close
What does it all mean?
• Cutting edge technology
• Existing Infrastructure
• Redefining Web surfing
• MAJOR “real world” implications
Closing Thoughts
Preservation not for
privileged priesthoodhttp://doi.acm.org/10.1145/1592761.1592794
http://booktwo.org/notebook/wikipedia-historiography/
no more hoary storiesabout format obsolescence:http://blog.dshr.org/2010/09/reinforcing-my-point.html
Don't dessicate resources;
leave them on the webEndless metadata is not
preservation…
archiving as branded service, not infrastructurehttp://blog.dshr.org/2010/06/jcdl-2010-keynote.html
Acknowledgements
• Slides borrowed from:
• Dr. Michael L. Nelson:
– http://www.slideshare.net/phonedude/my-point-of-view-michael-l-nelson-web-archiving-cooperative
– http://www.slideshare.net/phonedude/review-of-web-archiving
– http://www.slideshare.net/phonedude/memento-time-travel-for-the-web
• Martin Klein:
– http://www.slideshare.net/phonedude/synchronicity-justintime-discovery-of-lost-web-pages