1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.

Post on 21-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

1

WebBase and Stanford Digital Library Project

Junghoo Cho

Stanford University

Economic Weaknesses

Information Loss

Information Overload

Service Heterogeneity

Physical Barriers

• Interoperability

• Value Filtering

• Mobile Access

• IP Infrastructure

• Archival Repository

Technologies forTechnologies forDigital LibrariesDigital Libraries

2

RepositoryMulticastEngine

WWW

FeatureRepository

RetrievalIndexes

Webbase API

Web CrawlerWeb

CrawlerWeb CrawlerWeb Crawlers

Client Client Client Client

Client ClientWebBase Architecture

3

4

What is a Crawler?

web

init

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

5

Crawling Issues (1)

Load at visited web sites Load at crawlers Scope of the crawl

6

Crawling Issues (2)

Maintaining pages “fresh”– How does the web change over time?

– What does fresh page/database mean?

– How can we increase “freshness”?

7

Outline

Web evolution experiments Freshness metrics Crawling policy

8

Web Evolution Experiment

How often does a web page change? What is the lifespan of a page? How long does it take for 50% of the web to

change?

9

Experimental Setup

February 17 to June 24, 1999 270 sites visited (with permission)

– identified 400 sites with highest “page rank”

– contacted administrators

720,000 pages collected– 3,000 pages from each site daily

– start at root, visit breadth first (get new & old pages)

– ran only 9pm - 6am, 10 seconds between site requests

10

Page Change

Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days

Is this correct?

1 day

changes

page visited

11

Average Change Intervalfr

actio

n of

pag

es

12

Change Interval - By Domainfr

actio

n of

pag

es

13

Modeling Web Evolution

Poisson process with rate T is time to next event fT(t) = e-t (t > 0)

14

Change Interval of Pages

for pages thatchange every

10 days on average

interval in days

frac

tion

of c

hang

esw

ith g

iven

inte

rval

Poisson model

15

Change Metrics

Freshness– Freshness of page ei at time t is

F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise

eiei

......

web database

– Freshness of the database S at time t is

F( S ; t ) = F( ei ; t )N

1 N

i=1

16

Change Metrics

Age– Age of page ei at time t is

A( ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise

eiei

......

web database– Age of the database S at time t is

A( S ; t ) = A( ei ; t )N

1 N

i=1

17

Change Metrics

F(ei)

A(ei)

0

0

1

time

time

update refresh

F( S ) = lim F(S ; t ) dtt1 t

0t

F( ei ) = lim F(ei ; t ) dtt1 t

0t

Time averages:

similar for age...

18

Trick Question

Two page database e1 changes daily e2 changes once a week Can visit pages once a week How should we visit pages?

– e1 e1 e1 e1 e1 e1 ...

– e2 e2 e2 e2 e2 e2 ...

– e1 e2 e1 e2 e1 e2 ... [uniform]

– e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional]

– ?

e1

e2

e1

e2

webdatabase

19

Proportional Often Not Good!

Visit fast changing e1 get 1/2 day of freshness

Visit slow changing e2 get 1/2 week of freshness

Visiting e2 is a better deal!

20

Selecting Optimal Refresh Frequency

• Analysis is complex• Shape of curve is the same in all cases• Holds for any distribution g( )

21

Optimal Refresh Frequency for Age

• Analysis is also complex• Shape of curve is the same in all cases• Holds for any distribution g( )

22

Comparing Policies

Freshness AgeProportional 0.12 400 daysUniform 0.57 5.6 daysOptimal 0.62 4.3 days

Based on Statistics from experimentand revisit frequency of every month

23

Summary

Maintaining the collection fresh:– Web evolution experiment

– Change metrics

– Optimal policy

Intuitive policy does not always perform well– Should be careful in deciding revisit policy

top related