Top Banner
1 Internet and Data Internet and Data Management Management Junghoo “John” Cho Junghoo “John” Cho UCLA Computer Science UCLA Computer Science
32

1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

1

Internet and Data ManagementInternet and Data Management

Junghoo “John” ChoJunghoo “John” Cho

UCLA Computer ScienceUCLA Computer Science

Page 2: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

2

Legacy database Plain text files

Biblio sever

Information GaloreInformation Galore

Page 3: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

3

Challenges: Too much information?Challenges: Too much information?

DiscoveryDiscovery ManagementManagement OverloadOverload AccessAccess ……

Page 4: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

4

ApproachesApproaches

Central caching and IndexingCentral caching and Indexing– Google, Excite, AltaVistaGoogle, Excite, AltaVista

Dynamic integrationDynamic integration– MySimon, BizRateMySimon, BizRate

Page 5: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

5

Central Caching and IndexingCentral Caching and Indexing

Central Index

Page 6: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

6

ChallengesChallenges

Page selection and downloadPage selection and download– What page to download?What page to download?

Page and index updatePage and index update– How to update pages?How to update pages?

Page rankingPage ranking– What page is “important” or “relevant”?What page is “important” or “relevant”?

ScalabilityScalability

Page 7: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

7

Dynamic IntegrationDynamic Integration

Mediator

Wrapper

Source 1

Wrapper

Source 2

Wrapper

Source n

Page 8: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

8

Heterogeneous sourcesHeterogeneous sources– Different data models: relational, object-orientedDifferent data models: relational, object-oriented– Different schemas and representations:Different schemas and representations:

““Keanu ReevesKeanu Reeves” or “” or “Reeves, K.Reeves, K.” etc.” etc. Limited query capabilitiesLimited query capabilities Mediator cachingMediator caching

ChallengesChallenges

Page 9: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

9

Outline of This TalkOutline of This Talk

How can we maintain pages fresh?How can we maintain pages fresh? How does the Web change?How does the Web change? What do we mean by “fresh” pages?What do we mean by “fresh” pages? How should we refresh pages?How should we refresh pages?

Page 10: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

10

Web Evolution ExperimentWeb Evolution Experiment

How often does a Web page change?How often does a Web page change? How long does a page stay on the Web?How long does a page stay on the Web? How long does it take for 50% of the Web How long does it take for 50% of the Web

to change?to change? How do we model Web changes?How do we model Web changes?

Page 11: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

11

Experimental SetupExperimental Setup

February 17 to June 24, 1999February 17 to June 24, 1999 270 sites visited (with permission)270 sites visited (with permission)

– identified 400 sites with highest “PageRank”identified 400 sites with highest “PageRank”– contacted administratorscontacted administrators

720,000 pages collected720,000 pages collected– 3,000 pages from each site daily3,000 pages from each site daily– start at root, visit breadth first (get new & old pages)start at root, visit breadth first (get new & old pages)– ran only 9pm - 6am, 10 seconds between site requestsran only 9pm - 6am, 10 seconds between site requests

Page 12: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

12

Average Change IntervalAverage Change Intervalfr

actio

n of

pag

es

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

1day 1day- 1week

1week-1month

1month-4months

4months

average change interval

Page 13: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

13

Change Interval – By DomainChange Interval – By Domainfr

actio

n of

pag

es

0

0.1

0.2

0.3

0.4

0.5

0.6

1day 1day- 1week

1week-1month

1month-4months

4months

com

netorg

edu

gov

average change interval

Page 14: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

14

Modeling Web EvolutionModeling Web Evolution

Poisson process with rate Poisson process with rate T is time to next eventT is time to next event ffTT ( (tt) = ) = ee--tt ((t > t > 0)0)

Page 15: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

15

Change Interval of PagesChange Interval of Pagesfor pages thatchange every

10 days on average

interval in days

frac

tion

of c

hang

esw

ith g

iven

inte

rval

Poisson model

Page 16: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

16

Change MetricsChange Metrics

FreshnessFreshness– Freshness of element Freshness of element eeii at time at time tt is is

F F ( ( eeii ; ; tt ) = 1 if ) = 1 if eeii is up-to-date at time is up-to-date at time tt 0 otherwise 0 otherwise

eiei

......

web databaseFreshness of the database S at time t is

F( S ; t ) = F( ei ; t )

(Assume “equal importance” of pages)

N

1 N

i=1

Page 17: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

17

Change MetricsChange Metrics

AgeAge– Age of element Age of element eeii at time at time tt is is

A A( ( eeii ; ; tt ) = 0 if ) = 0 if eeii is up-to-date at time is up-to-date at time tt tt - (modification - (modification eei i time) otherwisetime) otherwise

eiei

......

web databaseAge of the database S at time t is

A( S ; t ) = A( ei ; t )

(Assume “equal importance” of pages)

N

1 N

i=1

Page 18: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

18

Change MetricsChange Metrics

F(ei)

A(ei)

0

0

1

time

time

update refresh

Time averages:

Page 19: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

19

Trick QuestionTrick Question

Two page databaseTwo page database e1 changes dailychanges daily e2 changes once a weekchanges once a week Can visit one page per weekCan visit one page per week How should we visit pages?How should we visit pages?

– e1 e2 e1 e2 e1 e2 e1 e2... ... [uniform] [uniform]

– e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … … [proportional][proportional]

– e1 e1 e1 e1 e1 e1 ... ...

– e2 e2 e2 e2 e2 e2 ... ...

– ??

e1

e2

e1

e2

webdatabase

Page 20: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

20

Proportional Often Not Good!Proportional Often Not Good!

Visit fast changing Visit fast changing e1

get 1/2 day of freshnessget 1/2 day of freshness

Visit slow changing Visit slow changing e2

get 1/2 week of freshnessget 1/2 week of freshness

Visiting Visiting ee22 is a better deal! is a better deal!

Page 21: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

21

Optimal Refresh FrequencyOptimal Refresh Frequency

ProblemProblem

Given Given and and f ,f ,

find find ff11, f, f22,.., f,.., fNN that maximizethat maximize

Page 22: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

22

Optimal Refresh FrequencyOptimal Refresh Frequency

• Shape of curve is the same in all cases• Holds for any change frequency distribution

Page 23: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

23

Optimal Refresh for AgeOptimal Refresh for Age

• Shape of curve is the same in all cases• Holds for any change frequency distribution

Page 24: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

24

Comparing PoliciesComparing Policies

Freshness AgeProportional 0.12 400 days

Uniform 0.57 5.6 daysOptimal 0.62 4.3 days

Based on Statistics from experimentand revisit frequency of every month

Page 25: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

25

Not Every Page is Equal!Not Every Page is Equal!

e1

e2 Accessed by users 20 times/day

Accessed by users 10 times/day

Some pages are “more important”Some pages are “more important”

In general,

F (S ) = 1 F (e1) + 2 F (e2)

Page 26: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

26

Weighted FreshnessWeighted Freshness

w = 1

w = 2

f

Page 27: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

27

Change Frequency EstimationChange Frequency Estimation

How to estimate change frequency?How to estimate change frequency?– Naïve Estimator: Naïve Estimator: XX//TT

– XX: number of detected changes: number of detected changes

– TT: monitoring period: monitoring period

– 2 changes in 10 days: 0.2 times/day2 changes in 10 days: 0.2 times/day

Change detected1 day

Page visitedPage changed

Incomplete change historyIncomplete change history

Page 28: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

28

Improved EstimatorImproved Estimator

Based on the Poisson modelBased on the Poisson model

– XX: number of detected changes: number of detected changes– NN: number of accesses: number of accesses– f f : access frequency: access frequency

3 changes in 10 days: 0.36 times/day Accounts for “missed” changes

Page 29: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

29

Improvement Significant?Improvement Significant?

Application to a Web crawlerApplication to a Web crawler– Visit pages once every week for 5 weeksVisit pages once every week for 5 weeks– Estimate change frequency Estimate change frequency – Adjust revisit frequency based on the estimateAdjust revisit frequency based on the estimate

» Uniform: do not adjustUniform: do not adjust

» Naïve: based on the naïve estimatorNaïve: based on the naïve estimator

» Ours: based on our improved estimatorOurs: based on our improved estimator

Page 30: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

30

Improvement from Our EstimatorImprovement from Our Estimator

Detected changesDetected changes Ratio to uniformRatio to uniform

UniformUniform 2,147,5892,147,589 100%100%

NaïveNaïve 4,145,5824,145,582 193%193%

OursOurs 4,892,1164,892,116 228%228%

(9,200,000 visits in total)

Page 31: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

31

WebArchive ProjectWebArchive Project

Can we store the history of the Web?Can we store the history of the Web?– Web is ephemeralWeb is ephemeral– Study of the Web evolutionStudy of the Web evolution

ChallengesChallenges– Update?Update?– Compression?Compression?– New storage?New storage?– Indexing?Indexing?

Page 32: 1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

32

ConclusionConclusion

Exciting area and many challenges ahead!Exciting area and many challenges ahead! Thank you for your attentionThank you for your attention For more information visitFor more information visit

http://www.cs.ucla.edu/~cho/http://www.cs.ucla.edu/~cho/