Top Banner
Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 20, 2007
31

Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

Jan 18, 2016

Download

Documents

Meagan Bennett
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

Factors Affecting Website Reconstruction from the Web Infrastructure

Frank McCown, Norou Diawara, and Michael L. Nelson

Old Dominion UniversityComputer Science Department

Norfolk, Virginia, USA

JCDL 2007Vancouver, BCJune 20, 2007

Page 2: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

2

Outline

• Web-repository crawling with Warrick• How successful is a reconstruction?• Reconstruction experiment • Significant findings

Page 3: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

3Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

Page 4: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

4

Crawling the Crawlers

World Wide Web

Repo1

Repo2

Repon

...

Web crawling

Repo

Web-repository crawling

Page 5: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

5

Page 6: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

6

Page 7: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

7

Cached Image

Page 8: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

Cached PDF

http://www.fda.gov/cder/about/whatwedo/testtube.pdf

MSN version Yahoo version Google version

canonical

Page 9: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

10

• McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.

• McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.

• McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006.

• McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.

Available at http://warrick.cs.odu.edu/

Page 10: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

11

Page 11: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

13

Measuring the Difference

(rc, rm, ra)

changed missing added

Apply Recovery Vector for each resource

Compute Difference Vector for website

Page 12: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

14

Some Difference Vectors

D = (changed, missing, added)

(0,0,0) – Perfect recovery

(1,0,0) – All resources are recovered but changed

(0,1,0) – All resources are lost

(0,0,1) – All recovered resources are at new URIs

Page 13: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

15

How Much Change is a Bad Thing?

Lost Recovered

Page 14: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

16

How Much Change is a Bad Thing?

Lost Recovered

Page 15: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

17

Assigning Penalties

Apply to each resource

(Pc, Pm, Pa)Penalty Adjustment

Or Difference vector

Page 16: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

18

Defining Success

success = 1 – dm

Equivalent to percent of recovered resources

0 1

Less successful

More successful

Page 17: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

19

Reconstruction Experiment

• 300 websites chosen randomly from Open Directory Project (dmoz.org)

• Crawled and reconstructed each website every week for 14 weeks

• Examined change rates, age, decay, growth, recoverability

Page 18: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

20

Success of website recovery each week

*On average, we recovered 61% of a website on any given week.

Page 19: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

21

Recovery of Textual Resources

Page 20: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

22

Recovery by TLD

Page 21: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

23

Birth and Decay

Page 22: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

24

Recovery of HTML Resources

Page 23: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

25

Recovery by Age

Page 24: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

26

Statistics for Repositories

Page 25: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

27

Which Factors Are Significant?

• External backlinks• Internal backlinks• Google’s PageRank• Hops from root page• Path depth• MIME type

• Query string params• Age• Resource birth rate• TLD• Website size• Size of resources

Page 26: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

28

Mild Correlations

• Hops and – website size (0.428)– path depth (0.388)

• Age and # of query params (-0.318)

• External links and – PageRank (0.339)– Website size (0.301)– Hops (0.320)

Page 27: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

29

Regression Analysis

• No surprises: all variables are significant, but overall model only explains about half of the observations

• Three most significant variables: PageRank, hops and age (R-squared = 0.1496)

Page 28: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

31

Conclusions

• Most of the sampled websites were relatively stable– One third of the websites never lost a single resource– Half of the websites never added any new resources

• The typical website can expect to get back 61% of its resources if it were lost today (77% textual, 42% images and 32% other)

• How to improve recovery from WI? Improve PageRank, decrease number of hops to resources, create stable URLs

Page 29: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

32

Thank You

Frank McCown

[email protected]://www.cs.odu.edu/~fmccown/

Sorry, Dad… You lost me in the first

two minutes.

Page 30: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

33

Injecting Server Components into Crawlable Pages

Erasure codesHTML pages Recover at least

m blocks

Page 31: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

34

Database

Perlscript

config

Static files (html files, PDFs,

images, style sheets, Javascript, etc.)

Web Infrastructure

Web Infrastructure

Web Server

Dynamicpage

Recoverable

Not Recoverable