Top Banner
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University
37

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

How to Crawl the Web

Junghoo ChoHector Garcia-Molina

Stanford University

Page 2: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

2

What is a Crawler?

web

init

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

Page 3: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

3

Crawling at Stanford

WebBase Project BackRub Search Engine, Page Rank Google New WebBase Crawler

Page 4: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

4

Crawling Issues (1)

Load at visited web sites– robots.txt files

– space out requests to a site

– limit number of requests to a site per day

– limit depth of crawl

– multicast

Page 5: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

5

Crawling Issues (2)

Load at crawler– parallelize

init

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

init

get next url

get page

extract urls

?

Page 6: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

6

Crawling Issues (3)

Scope of crawl– not enough space for “all” pages

– not enough time to visit “all” pages

Solution: Visit “important” pages

visitedpagesmicr

osoft

microsoft

Page 7: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

7

Crawling Issues (4)

Incremental crawling– how do we keep pages “fresh”?

– how do we avoid crawling from scratch?

Page 8: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

8

Crawl-ordering Experiment

Domain: pages within Stanford Goal: Crawl HOT pages!

– Experiment 1: the pages with highest PageRank (top 10%)

– Experiment 2: the pages related to “admission”

Experiment:– How many HOT pages did we collect when we crawl, say,

20% of the Stanford domain?

Page 9: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

9

Experiment 1: Top 10% PageRank pages

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

PageRank

backlink

breadth

random

Ordering:

% crawled pages

% crawled HOT pages

Page 10: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

10

Experiment 2: Admission-related Pages

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

PageRank

backlink

breadth

random

Ordering :

Topic word : "admission"

% crawled pages

% crawled HOT pages

with topic bias

with topic bias

with topic bias

Page 11: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

11

Web Evolution Experiment

How often does a web page change? How long does it take for 50% of the web to change? How do we model web changes?

Page 12: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

12

Experimental Setup

February 17 to June 24, 1999 270 sites visited (with permission)

– identified 400 sites with highest “page rank”

– contacted administrators

720,000 pages collected– 3,000 pages from each site daily

– start at root, visit breadth first (get new & old pages)

– ran only 9pm - 6am, 10 seconds between site requests

Page 13: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

13

How Often Does a Page Change?

Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days

Is this correct?

1 day

changes

page visited

Page 14: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

14

Average Change Intervalfr

actio

n of

pag

es

Page 15: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

15

Average Change Interval — By Domainfr

actio

n of

pag

es

Page 16: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

16

Time for a 50% Change

days

frac

tion

of u

ncha

nged

pag

es

Page 17: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

17

Modeling Web Evolution

Poisson process with rate T is time to next event fT(t) = e-t (t > 0)

Page 18: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

18

Change Interval of Pages

for pages thatchange every

10 days on average

interval in days

frac

tion

of c

hang

esw

ith g

iven

inte

rval

Poisson model

Page 19: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

19

Change Metrics

Freshness– Freshness of element ei at time t is

F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise

eiei

......

web database

– Freshness of the database S at time t is

F( S ; t ) = F( ei ; t )N

1 N

i=1

Page 20: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

20

Change Metrics

Age– Age of element ei at time t is

A( ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise

eiei

......

web database

– Age of the database S at time t is

A( S ; t ) = A( ei ; t )N

1 N

i=1

Page 21: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

21

Change Metrics

F(ei)

A(ei)

0

0

1

time

time

update refresh

F( S ) = lim F(S ; t ) dtt1 t

0t

F( ei ) = lim F(ei ; t ) dtt1 t

0t

Time averages:

similar for age...

Page 22: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

22

Questions

How often should we revisit pages to 80% of pages up-to-date?

How should we revisit pages when they change at different frequencies?

Page 23: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

23

Freshness vs. Revisit Frequency

r = / f = average change frequency / average visit frequency

Page 24: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

24

Age vs. Revisit Frequency

r = / f = average change frequency / average visit frequency

= Age / time to refresh all N elements

Page 25: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

25

Trick Question

Two page database

e1 changes daily

e2 changes once a week

Can visit pages once a week How should we visit pages?

– e1 e1 e1 e1 e1 e1 ...

– e2 e2 e2 e2 e2 e2 ...

– e1 e2 e1 e2 e1 e2 ... [uniform]

– e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional]

– ?

e1

e2

e1

e2

webdatabase

Page 26: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

26

Proportional Often Not Good!

Visit fast changing e1 get 1/2 day of freshness

Visit slow changing e2 get 1/2 week of freshness

Visiting e2 is a better deal!

Page 27: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

27

Selecting Optimal Refresh Frequency

• Analysis is complex• Shape of curve is the same in all cases• Holds for any distribution g( )

Page 28: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

28

Optimal Refresh Frequency for Age

• Analysis is also complex• Shape of curve is the same in all cases• Holds for any distribution g( )

Page 29: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

29

Comparing Policies

Freshness AgeProportional 0.12 400 daysUniform 0.57 5.6 daysOptimal 0.62 4.3 days

Based on Statistics from experimentand revisit frequency of every month

Page 30: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

30

Crawler Types

In-place vs. shadow

Steady vs. batch

eiei

......web database

ei

...

shadowdatabase

time

crawler on

crawler off

Page 31: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

31

Comparison: Batch vs. Steady

batch modein-placecrawler

steadyin-placecrawler

crawler running

Page 32: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

32

Shadowing Steady Crawlercr

awle

r’s

colle

ctio

ncu

rren

t col

lect

ion

withoutshadowing

Page 33: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

33

Shadowing Batch Crawlercr

awle

r’s

colle

ctio

ncu

rren

t col

lect

ion

withoutshadowing

Page 34: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

34

Experimental Data Freshness

Steady BatchIn-Place 0.88 0.88Shadowing 0.77 0.86

• Pages change on average every 4 months• Batch crawler works one week out of 4

1

2

0.63

0.50

Page 35: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

35

Building an Incremental Crawler

Variable visit frequencies

(but with care…) Steady In-place

improves freshness!

Also need:– Page change frequency estimator

– Page replacement policy

– Page ranking policy (what are ‘important” pages?)

See papers....

Page 36: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

36

Summary

Selecting “important pages:”

Intuitive policy performs well

Maintaining the collection fresh:

Intuitive policy does not perform well

Page 37: How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

37

The End

Thank you for your attention... For more information, visit

http://www-db.stanford.eduFollow “Searchable Publications” link, andsearch for author = “Cho”