Top Banner
CS246: Page Selection
29

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

CS246: Page Selection

Page 2: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 2

Page Selection

• Infinite # of pages on the Web– E.g., infinite pages from a calendar site

• How to select the pages to download?

2

Page 3: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 3

Challenges Due to Infinity

• What does Web coverage mean?– 8 billion vs 20 billion

• How much should I download?– 8 billion? 100 billion?– How much have I covered?– When can I stop?

• How to maximize coverage?– How can we define coverage?

Page 4: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 4

RankMass

• Web coverage weighted by PageRank• Q: Why PageRank?• A:

– Primary ranking metric for search results– User’s visit probability under random surfer model

Page 5: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

PageRank

• A page is important if it is pointed by many important pages

• PR(p) = PR(p1)/c1 + … + PR(pk)/ck

pi : page pointing to p, ci : number of links in pi

• PageRank of p is the sum of PageRanks of its parents

• One equation for every page– N equations, N unknown variables

Page 6: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

PageRank: Random Surfer Model

• The probability of a Web surfer to reach a page after many clicks, following random links

Random Click

Page 7: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Damping Factor and Trust Score

• Users do not always follow link– They get distracted and “jump” to other pages

– d: Damping factor. Probability to follow links.– ti: Trust score. Non-zero only for the pages that

user trusts and jumps to.

• “TrustRank”, “Personalized PageRank”

ij

jji tdcpPRdpPR )1(/)()(

Page 8: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 8

RankMass: Definition

• RankMass of DC:

– Assuming personalized PageRank• Now what? How can we use it for the

crawling problem?

Ci Dp

iC pPRDRM )()(

8

Page 9: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 9

Two Crawling Challenges

• Coverage guarantee:– Given , make sure we download at least 1-

• Crawling efficiency:– For a given |DC|, pick DC such that RM(DC) is the

maximum

1)( CDRM

9

Page 10: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 10

RankMass Guarantee

• Q: How can we provide RankMass guarantee when we stop?

• Q: How do we calculate RankMass without downloading the whole Web?

• Q: Any way to provide the guarantee without knowing the exact PageRank?

Page 11: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 11

RankMass Guarantee

• We can’t compute the exact PageRank but can lower bound

• How? Let’s a start with a simple case

11

Page 12: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 12

Single Trusted Page

• t1=1 ; ti = 0 (i≠1)

• Always jump to p1 when bored

• NL(p1): pages reachable from p1 in L links

12

Page 13: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 13

Single Trusted Page

13

Q: What is the probability to get to a page L links away from P1?

Page 14: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 14

RankMass Lower Bound: Single Trusted Page

• Assuming the trust vector T(1), the sum of the PageRank values of all L-neighbors of p1 is at least dL+1 close to 1

)(

1

1

1)Pr(pNp

Li

Li

dp

14

Page 15: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 15

PageRank Linearity

• Let be PageRank vector . based on trust vector

,That is

• , Then for any

15

kij

jjkik tdcpPRdpPR )1(/)()(

)()1()()( 213 iii pPRwpPRwpPR

][ kik tT

213 )1( TwwTT

)]([ ikk pPRPR

Page 16: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 16

RankMass Lower Bound: General Case

• The RankMass of the L-neighbors of the group of all trusted pages G, NL(G), is at least dL+1 close to 1. That is:

• Q: Given the result, how should we download for RankMass guarantee?

)(

11)(GNp

Li

Li

dpPR

16

Page 17: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 17

The L-Neighbor Crawler

1. L := 02. N[0] = {pi| ti > 0} // Start with trusted pages3. While ( < dL+1)

1. Download all uncrawled pages in N[L]2. N[L + 1] = {all pages linked to by a page in N[L]}3. L = L + 1

• Essentially, a BFS (Breadth-First Search) crawling algorithm

17

Page 18: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 18

Crawling Efficiency

• For a given |DC|, pick DC such that RM(DC) is the maximum

• Q: Can we use L-Neighbor?• A:

– L-Neighbor simple, but we need to further prioritize certain pages over others

– Page level prioritization.

18

Page 19: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 19

Page Level Prioritization

• Q: What page should we download first to maximize RankMass?

• A: Pages with high PageRank• Q: How do we know high PageRank pages?• The idea:

– Calculate PageRank lower bound of undownloaded pages– Give high priority to high lower bound pages

19

Page 20: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 20

Calculating PageRank Lower Bound

• PR(p): Probability random surfer at p• Breakdown path by “interrupts”, jumps to a

trusted page• Sum up all paths that start with an interrupt,

jump to a trusted page and end with p

Interrupt Pj

(1-d) (tj) (d*1/3) (d*1/5) (d*1/3)(d*1/4) (d*1/3) (d*1/3)

P3P1 P2 P4 P5 Pi

20

Page 21: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 21

Calculating PageRank Lower Bound

• Q: What if we sum up the probabilities of the subsets of the paths to p?

• A: “lower bound” of PageRank p• Basic idea

– Start with the set of trusted pages G– Enumerate paths to a page p as we discover links– Sum up the probability of each discovered path to p

• Not every path needed. Only the ones that we have discovered so far

Page 22: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 22

RankMass Crawler: High Level

• Dynamically update lower bound on PageRank– By enumerating paths to pages

• Download page with highest lower bound– Sum of downloaded lower bounds = RankMass

coverage

22

Page 23: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 23

RankMass Crawler• CRM = 0 // CRM: crawled RankMass• rmi = (1 − d)ti for each ti > 0

// rmi : RankMass (PageRank lower bound) of pi

• While (CRM < 1 − ):– Pick pi with the largest rmi.– Download pi if not downloaded yet

• CRM = CRM + rmi // we have downloaded pi

• For each pj linked to by pi: rmj = rmj + d/ci rmi// Update RankMass based on the discovered links from pi

• rmi = 0

23

Page 24: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 24

Experimental Setup

• HTML files only• Algorithms simulated over web graph• Crawled between Dec’ 2003 and Jan’ 2004• 141 millon URLs span over 6.9 million host

names• 233 top level domains.

24

Page 25: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 25

Metrics Of Evaluation

1. How much RankMass is actually collected during the crawl

2. How much RankMass is “known” to have been collected during the crawl

25

Page 26: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 26

L-Neighbor

26

Page 27: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 27

RankMass

27

Page 28: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Junghoo "John" Cho (UCLA Computer Science) 28

Algorithm Efficiency

Algorithm Downloads

required for

above 0.98%

guaranteed

RankMass

Downloads

required

for above

0.98% actual

RankMass

L-Neighbor 7 million 65,000

RankMass 131,072 27,939

Optimal 27,101 27,101

28

Page 29: CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Summary

• Web crawler and its challenges• Page selection problem• PageRank• RankMass guarantee• Computing PageRank lower bound• RankMass crawling algorithm• Any questions?