Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 20, 2007
Jan 18, 2016
Factors Affecting Website Reconstruction from the Web Infrastructure
Frank McCown, Norou Diawara, and Michael L. Nelson
Old Dominion UniversityComputer Science Department
Norfolk, Virginia, USA
JCDL 2007Vancouver, BCJune 20, 2007
2
Outline
• Web-repository crawling with Warrick• How successful is a reconstruction?• Reconstruction experiment • Significant findings
3Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg
4
Crawling the Crawlers
World Wide Web
Repo1
Repo2
Repon
...
Web crawling
Repo
Web-repository crawling
5
6
7
Cached Image
Cached PDF
http://www.fda.gov/cder/about/whatwedo/testtube.pdf
MSN version Yahoo version Google version
canonical
10
• McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.
• McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.
• McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006.
• McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.
Available at http://warrick.cs.odu.edu/
11
13
Measuring the Difference
(rc, rm, ra)
changed missing added
Apply Recovery Vector for each resource
Compute Difference Vector for website
14
Some Difference Vectors
D = (changed, missing, added)
(0,0,0) – Perfect recovery
(1,0,0) – All resources are recovered but changed
(0,1,0) – All resources are lost
(0,0,1) – All recovered resources are at new URIs
15
How Much Change is a Bad Thing?
Lost Recovered
16
How Much Change is a Bad Thing?
Lost Recovered
17
Assigning Penalties
Apply to each resource
(Pc, Pm, Pa)Penalty Adjustment
Or Difference vector
18
Defining Success
success = 1 – dm
Equivalent to percent of recovered resources
0 1
Less successful
More successful
19
Reconstruction Experiment
• 300 websites chosen randomly from Open Directory Project (dmoz.org)
• Crawled and reconstructed each website every week for 14 weeks
• Examined change rates, age, decay, growth, recoverability
20
Success of website recovery each week
*On average, we recovered 61% of a website on any given week.
21
Recovery of Textual Resources
22
Recovery by TLD
23
Birth and Decay
24
Recovery of HTML Resources
25
Recovery by Age
26
Statistics for Repositories
27
Which Factors Are Significant?
• External backlinks• Internal backlinks• Google’s PageRank• Hops from root page• Path depth• MIME type
• Query string params• Age• Resource birth rate• TLD• Website size• Size of resources
28
Mild Correlations
• Hops and – website size (0.428)– path depth (0.388)
• Age and # of query params (-0.318)
• External links and – PageRank (0.339)– Website size (0.301)– Hops (0.320)
29
Regression Analysis
• No surprises: all variables are significant, but overall model only explains about half of the observations
• Three most significant variables: PageRank, hops and age (R-squared = 0.1496)
31
Conclusions
• Most of the sampled websites were relatively stable– One third of the websites never lost a single resource– Half of the websites never added any new resources
• The typical website can expect to get back 61% of its resources if it were lost today (77% textual, 42% images and 32% other)
• How to improve recovery from WI? Improve PageRank, decrease number of hops to resources, create stable URLs
32
Thank You
Frank McCown
[email protected]://www.cs.odu.edu/~fmccown/
Sorry, Dad… You lost me in the first
two minutes.
33
Injecting Server Components into Crawlable Pages
Erasure codesHTML pages Recover at least
m blocks
34
Database
Perlscript
config
Static files (html files, PDFs,
images, style sheets, Javascript, etc.)
Web Infrastructure
Web Infrastructure
Web Server
Dynamicpage
Recoverable
Not Recoverable