YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Pagerank (1)

The PageRank Citation Ranking:The PageRank Citation Ranking:Bringing Order to the WebBringing Order to the Web

Larry Page etc.

Stanford University

Presented by

Guoqiang Su & Wei Li

Page 2: Pagerank (1)

ContentsContents

MotivationRelated workPage Rank & Random Surfer ModelImplementationApplicationConclusion

Page 3: Pagerank (1)

MotivationMotivation

Web: heterogeneous and unstructuredFree of quality control on the webCommercial interest to manipulate ranking

Page 4: Pagerank (1)

Related WorkRelated Work

Academic citation analysisLink-based analysisClustering methods of link structureHubs & Authorities Model

Page 5: Pagerank (1)

BacklinkBacklink

Link Structure of the WebApproximation of importance / quality

Page 6: Pagerank (1)

PageRankPageRank

Pages with lots of backlinks are importantBacklinks coming from important pages

convey more importance to a page

Problem: Rank Sink

uBv vN

vRcuR

)()(

Page 7: Pagerank (1)

Rank SinkRank SinkPage cycles pointed by some incoming link

Problem: this loop will accumulate rank but never distribute any rank outside

Page 8: Pagerank (1)

Escape TermEscape Term

Solution: Rank Source

c is maximized and = 1E(u) is some vector over the web pages

– uniform, favorite page etc.

)()(

)( ucEN

vRcuR

uBv v

1R

Page 9: Pagerank (1)

Matrix NotationMatrix Notation

R is the dominant eigenvector and c is the dominant eigenvalue of because c is maximized

ReEAcR TT )(

)( TeEA

Page 10: Pagerank (1)

Computing PageRankComputing PageRank

- initialize vector over web pages

loop:

- new ranks sum of normalized backlink ranks

- compute normalizing factor

- add escape term

- control parameter

while - stop when converged

SR 0

iT

i RAR 1

111 ii RRd

dERR ii 11

ii RR 1

Page 11: Pagerank (1)

Random Surfer ModelRandom Surfer Model Page Rank corresponds to the probability

distribution of a random walk on the web graphs

E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever

Page 12: Pagerank (1)

ImplementationImplementationComputing resources — 24 million pages — 75 million URLs

Memory and disk storage

Weight Vector

(4 byte float)

Matrix A (linear access)

Page 13: Pagerank (1)

Implementation (Con't)Implementation (Con't)

Unique integer ID for each URLSort and Remove dangling linksRank initial assignmentIteration until convergenceAdd back dangling links and Re-compute

Page 14: Pagerank (1)

Convergence PropertiesConvergence PropertiesGraph (V, E) is an expander with factor if

for all (not too large) subsets S: |As| |s|Eigenvalue separation: Largest eigenvalue

is sufficiently larger than the second-largest eigenvalue

Random walk converges fast to a limiting probability distribution on a set of nodes in the graph.

Page 15: Pagerank (1)

Convergence Properties (con't)Convergence Properties (con't)PageRank computation is O(log(|V|)) due to

rapidly mixing graph G of the web.

Page 16: Pagerank (1)

Personalized PageRankPersonalized PageRankRank Source E can be initialized :

– uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives

result in overly high ranking– total weight on a single page, e.g. Netscape, McCarthy

great variation of ranks under different single pages as rank source

– and everything in-between, e.g. server root pages

allow manipulation by commercial interests

Page 17: Pagerank (1)

Applications IApplications IEstimate web traffic

– Server/page aliases

– Link/traffic disparity, e.g. porn sites, free web-mail

Backlink predictor– Citation counts have been used to predict future citations

– very difficult to map the citation structure of the web completely

– avoid the local maxima that citation counts get stuck in and get better performance

Page 18: Pagerank (1)

Applications II - Ranking ProxyApplications II - Ranking Proxy

Surfer's Navigation Aid

Annotating links by PageRank (bar graph)

Not query dependent

Page 19: Pagerank (1)

IssuesIssues Users are no random walkers – Content based methods Starting point distribution

– Actual usage data as starting vector

Reinforcing effects/bias towards main pages How about traffic to ranking pages? No query specific rank Linkage spam – PageRank favors pages that managed to get other pages to link to them – Linkage not necessarily a sign of relevancy, only of promotion (advertisement…)

Page 20: Pagerank (1)

Evaluation IEvaluation I

Page 21: Pagerank (1)

Evaluation IIEvaluation II

Page 22: Pagerank (1)

ConclusionConclusionPageRank is a global ranking based on the

web's graph structurePageRank use backlinks information to

bring order to the webPageRank can separate out representative

pages as cluster centerA great variety of applications


Related Documents