Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

Website Reconstruction using the Web Infrastructure

Frank McCownhttp://www.cs.odu.edu/~fmccown/

Doctoral Consortium

June 11, 2006

http://www.cs.odu.edu/~fmccown/



Web Infrastructure

4

HTTP 404

6

Cost of Preservation

H L H

Publisher’s cost (time, equipment, knowledge)

LOCKSS

Browser cache

TTApacheiPROXY

Furl/Spurl

InfoMonitor

Filesystem backups

Coverage of the Web

H

Client-view Server-view

Web archivesSE caches

Hanzo:web

7

Research Questions

How much digital preservation of websites is afforded by lazy preservation? Can we reconstruct entire websites from the WI? What factors contribute to the success of website

reconstruction? Can we predict how much of a lost website can

be recovered? How can the WI be utilized to provide

preservation of server-side components?

8

Prior Work

Is website reconstruction from WI feasible? Web repository: G,M,Y,IA Web-repository crawler: Warrick Reconstructed 24 websites

How long do search engines keep cached content after it is removed?

9

Timeline of SE Resource Acquisition and Release

Vulnerable resource – not yet cached (tca is not defined)

Replicated resource – available on web server and SE cache (tca < current time < tr)

Endangered resource – removed from web server but still cached (tca < current time < tcr)

Unrecoverable resource – missing from web server and cache (tca < tcr < current time)

Joan A. Smith, Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites, D-Lib Magazine, 12(2), February 2006. Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical report, arXiv cs.IR/0512069, 2005.

12

How Much Did We Reconstruct?

A

“Lost” web site Reconstructed web site

B C

D E F

A

B’ C’

G E

F

Missing link to D; points to old resource G

F can’t be found

13

Reconstruction Diagram

added 20%

identical 50%

changed 33%

missing 17%

Results

Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.

http://arxiv.org/abs/cs.IR/0512069

http://arxiv.org/abs/cs.IR/0512069

15

Warrick Milestones

www2006.org – first lost website reconstructed (Nov 2005)

DCkickball.org – first website someone else reconstructed without our help (late Jan 2006)

www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006)

Internet Archive officially “blesses” Warrick (mid Mar 2006)1

1http://frankmccown.blogspot.com/2006/03/warrick-is-gaining-traction.html

http://www2006.org/

http://dckickball.org/

http://www.iclnet.org/

http://frankmccown.blogspot.com/2006/03/warrick-is-gaining-traction.html

16

Proposed Work

How lazy can we afford to be? Find factors influencing success of website

reconstruction from the WI Perform search engine cache characterization

Inject server-side components into WI for complete website reconstruction

Improving the Warrick crawler Evaluate different crawling policies Development of web-repository API for inclusion

in Warrick

17

Factors Influencing Website Recoverability from the WI

Previous study did not find statistically significant relationship between recoverability and website size or PageRank

Methodology Sample large number of websites - dmoz.org Perform several reconstructions over time using

same policy Download sites several times over time to capture

change rates

18

Evaluation

Use statistical analysis to test for the following factors: Size Makeup Path depth PageRank Change rate

Create a predictive model – how much of my lost website do I expect to get back?

19

SE Cache Characterization

Web characterization is an active field Search engine caches have never been characterized Methodology

Randomly sample URLs from four popular search engines: Google, MSN, Yahoo, Ask

Access cached version if present Download live version from the Web Examine HTTP headers and page content Attempt to access various resource types (PDF,

Word, PS, etc.) in each SE cache

20

Evaluation

Compute the ratio of indexed to cached Find types, size, age of resources Do http Cache-control directives ‘no-cache’ and

‘no-store’ stop resources from being cached? Compare different SE caches compare How prevalent is the use of NOARCHIVE meta

tags to keep HTML pages from being cached? How much of the Web is cached by SEs? What is the overlap with the Internet Archive?

Marshall TR Server – running EPrints

We can recover the missing page and PDF, but what about the services?

23

Recovery of Web Server Components

Recovering the client-side representation is not enough to reconstruct a dynamically-produced website

How can we inject the server-side functionality into the WI?

Web repositories like HTML Canonical versions stored by all web repos Text-based Comments can be inserted without changing

appearance of page

24

Injection Techniques

Inject entire server file into HTML comments Divide server file into parts and insert parts

into HTML comments Use erasure codes to break a server file into

chunks and insert the chunks into HTML comments of different pages

25

Recover Server File from WI

26

Evaluation

Find the most efficient values for n and r (chunks created/recovered)

Security Develop simple mechanism for selecting files that

can be injected into the WI Address encryption issues

Reconstruct an EPrints website with a few hundred resources

Recent Work

URL canonicalization Crawling policies

Naïve policy Knowledgeable policy Exhaustive policy

Reconstruct 24 websites with each policy Found that exhaustive and knowledgeable

are significantly more efficient at recovering websites

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Efficiency ratio bins

Fre

qu

ency

Naive

Knowledgeable

Exhaus tive

Frank McCown and Michael L. Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006, To appear.

28

Warrick API

API should provide a clear and flexible interface for web repositories

Goals: Shield Warrick from changes to WI Facilitate inclusion of new web repositories Minimize implementation and maintenance costs

29

Evaluation

Internet Archive has endorsed use of Warrick Make Warrick available on SourceForge Measure the community adoption &

modification

30

Risks and Threats

Time for enough resources to be cached Search engine caching behavior may change

at any time Repository antagonism

Spam Cloaking

TimetableTimeline

32

Summary

When this work is completed, I will have… demonstrated and evaluated the lazy

preservation technique provided a reference implementation characterized SE caching behavior provided a layer of abstraction on top of SE

behavior (API) explored how much we store in the WI

(server-side vs. client-side representations)

33

Thank You

Questions?

Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.

Documents

Website Reconstruction using the Web Infrastructure Frank McCown fmccown/ fmccown/ Doctoral Consortium June.