Website Reconstruction using the Web Infrastructure
Frank McCownhttp://www.cs.odu.edu/~fmccown/
Doctoral Consortium
June 11, 2006
Web Infrastructure
4
HTTP 404
6
Cost of Preservation
H L H
Publisher’s cost (time, equipment, knowledge)
LOCKSS
Browser cache
TTApacheiPROXY
Furl/Spurl
InfoMonitor
Filesystem backups
Coverage of the Web
H
Client-view Server-view
Web archivesSE caches
Hanzo:web
7
Research Questions
How much digital preservation of websites is afforded by lazy preservation? Can we reconstruct entire websites from the WI? What factors contribute to the success of website
reconstruction? Can we predict how much of a lost website can
be recovered? How can the WI be utilized to provide
preservation of server-side components?
8
Prior Work
Is website reconstruction from WI feasible? Web repository: G,M,Y,IA Web-repository crawler: Warrick Reconstructed 24 websites
How long do search engines keep cached content after it is removed?
9
Timeline of SE Resource Acquisition and Release
Vulnerable resource – not yet cached (tca is not defined)
Replicated resource – available on web server and SE cache (tca < current time < tr)
Endangered resource – removed from web server but still cached (tca < current time < tcr)
Unrecoverable resource – missing from web server and cache (tca < tcr < current time)
Joan A. Smith, Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites, D-Lib Magazine, 12(2), February 2006. Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical report, arXiv cs.IR/0512069, 2005.
12
How Much Did We Reconstruct?
A
“Lost” web site Reconstructed web site
B C
D E F
A
B’ C’
G E
F
Missing link to D; points to old resource G
F can’t be found
13
Reconstruction Diagram
added 20%
identical 50%
changed 33%
missing 17%
Results
Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.
15
Warrick Milestones
www2006.org – first lost website reconstructed (Nov 2005)
DCkickball.org – first website someone else reconstructed without our help (late Jan 2006)
www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006)
Internet Archive officially “blesses” Warrick (mid Mar 2006)1
1http://frankmccown.blogspot.com/2006/03/warrick-is-gaining-traction.html
16
Proposed Work
How lazy can we afford to be? Find factors influencing success of website
reconstruction from the WI Perform search engine cache characterization
Inject server-side components into WI for complete website reconstruction
Improving the Warrick crawler Evaluate different crawling policies Development of web-repository API for inclusion
in Warrick
17
Factors Influencing Website Recoverability from the WI
Previous study did not find statistically significant relationship between recoverability and website size or PageRank
Methodology Sample large number of websites - dmoz.org Perform several reconstructions over time using
same policy Download sites several times over time to capture
change rates
18
Evaluation
Use statistical analysis to test for the following factors: Size Makeup Path depth PageRank Change rate
Create a predictive model – how much of my lost website do I expect to get back?
19
SE Cache Characterization
Web characterization is an active field Search engine caches have never been characterized Methodology
Randomly sample URLs from four popular search engines: Google, MSN, Yahoo, Ask
Access cached version if present Download live version from the Web Examine HTTP headers and page content Attempt to access various resource types (PDF,
Word, PS, etc.) in each SE cache
20
Evaluation
Compute the ratio of indexed to cached Find types, size, age of resources Do http Cache-control directives ‘no-cache’ and
‘no-store’ stop resources from being cached? Compare different SE caches compare How prevalent is the use of NOARCHIVE meta
tags to keep HTML pages from being cached? How much of the Web is cached by SEs? What is the overlap with the Internet Archive?
Marshall TR Server – running EPrints
We can recover the missing page and PDF, but what about the services?
23
Recovery of Web Server Components
Recovering the client-side representation is not enough to reconstruct a dynamically-produced website
How can we inject the server-side functionality into the WI?
Web repositories like HTML Canonical versions stored by all web repos Text-based Comments can be inserted without changing
appearance of page
24
Injection Techniques
Inject entire server file into HTML comments Divide server file into parts and insert parts
into HTML comments Use erasure codes to break a server file into
chunks and insert the chunks into HTML comments of different pages
25
Recover Server File from WI
26
Evaluation
Find the most efficient values for n and r (chunks created/recovered)
Security Develop simple mechanism for selecting files that
can be injected into the WI Address encryption issues
Reconstruct an EPrints website with a few hundred resources
Recent Work
URL canonicalization Crawling policies
Naïve policy Knowledgeable policy Exhaustive policy
Reconstruct 24 websites with each policy Found that exhaustive and knowledgeable
are significantly more efficient at recovering websites
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Efficiency ratio bins
Fre
qu
ency
Naive
Knowledgeable
Exhaus tive
Frank McCown and Michael L. Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006, To appear.
28
Warrick API
API should provide a clear and flexible interface for web repositories
Goals: Shield Warrick from changes to WI Facilitate inclusion of new web repositories Minimize implementation and maintenance costs
29
Evaluation
Internet Archive has endorsed use of Warrick Make Warrick available on SourceForge Measure the community adoption &
modification
30
Risks and Threats
Time for enough resources to be cached Search engine caching behavior may change
at any time Repository antagonism
Spam Cloaking
TimetableTimeline
32
Summary
When this work is completed, I will have… demonstrated and evaluated the lazy
preservation technique provided a reference implementation characterized SE caching behavior provided a layer of abstraction on top of SE
behavior (API) explored how much we store in the WI
(server-side vs. client-side representations)
33
Thank You
Questions?