YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Pagerank

The PageRank Citation Ranking:The PageRank Citation Ranking:Bringing Order to the WebBringing Order to the Web

Larry Page etc.

Stanford University

Presented by

Guoqiang Su & Wei Li

Page 2: Pagerank

ContentsContents

MotivationRelated workPage Rank & Random Surfer ModelImplementationApplicationConclusion

Page 3: Pagerank

MotivationMotivation

Web: heterogeneous and unstructuredFree of quality control on the webCommercial interest to manipulate ranking

Page 4: Pagerank

Related WorkRelated Work

Academic citation analysisLink-based analysisClustering methods of link structureHubs & Authorities Model

Page 5: Pagerank

BacklinkBacklink

Link Structure of the WebApproximation of importance / quality

Page 6: Pagerank

PageRankPageRank

Pages with lots of backlinks are importantBacklinks coming from important pages

convey more importance to a page

Problem: Rank Sink

uBv vN

vRcuR

)()(

Page 7: Pagerank

Rank SinkRank SinkPage cycles pointed by some incoming link

Problem: this loop will accumulate rank but never distribute any rank outside

Page 8: Pagerank

Escape TermEscape Term

Solution: Rank Source

c is maximized and = 1E(u) is some vector over the web pages

– uniform, favorite page etc.

)()(

)( ucEN

vRcuR

uBv v

1R

Page 9: Pagerank

Matrix NotationMatrix Notation

R is the dominant eigenvector and c is the dominant eigenvalue of because c is maximized

ReEAcR TT )(

)( TeEA

Page 10: Pagerank

Computing PageRankComputing PageRank

- initialize vector over web pages

loop:

- new ranks sum of normalized backlink ranks

- compute normalizing factor

- add escape term

- control parameter

while - stop when converged

SR 0

iT

i RAR 1

111 ii RRd

dERR ii 11

ii RR 1

Page 11: Pagerank

Random Surfer ModelRandom Surfer Model Page Rank corresponds to the probability

distribution of a random walk on the web graphs

E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever

Page 12: Pagerank

ImplementationImplementationComputing resources — 24 million pages — 75 million URLs

Memory and disk storage

Weight Vector

(4 byte float)

Matrix A (linear access)

Page 13: Pagerank

Implementation (Con't)Implementation (Con't)

Unique integer ID for each URLSort and Remove dangling linksRank initial assignmentIteration until convergenceAdd back dangling links and Re-compute

Page 14: Pagerank

Convergence PropertiesConvergence PropertiesGraph (V, E) is an expander with factor if

for all (not too large) subsets S: |As| |s|Eigenvalue separation: Largest eigenvalue

is sufficiently larger than the second-largest eigenvalue

Random walk converges fast to a limiting probability distribution on a set of nodes in the graph.

Page 15: Pagerank

Convergence Properties (con't)Convergence Properties (con't)PageRank computation is O(log(|V|)) due to

rapidly mixing graph G of the web.

Page 16: Pagerank

Personalized PageRankPersonalized PageRankRank Source E can be initialized :

– uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives

result in overly high ranking– total weight on a single page, e.g. Netscape, McCarthy

great variation of ranks under different single pages as rank source

– and everything in-between, e.g. server root pages

allow manipulation by commercial interests

Page 17: Pagerank

Applications IApplications IEstimate web traffic

– Server/page aliases

– Link/traffic disparity, e.g. porn sites, free web-mail

Backlink predictor– Citation counts have been used to predict future citations

– very difficult to map the citation structure of the web completely

– avoid the local maxima that citation counts get stuck in and get better performance

Page 18: Pagerank

Applications II - Ranking ProxyApplications II - Ranking Proxy

Surfer's Navigation Aid

Annotating links by PageRank (bar graph)

Not query dependent

Page 19: Pagerank

IssuesIssues Users are no random walkers – Content based methods Starting point distribution

– Actual usage data as starting vector

Reinforcing effects/bias towards main pages How about traffic to ranking pages? No query specific rank Linkage spam – PageRank favors pages that managed to get other pages to link to them – Linkage not necessarily a sign of relevancy, only of promotion (advertisement…)

Page 20: Pagerank

Evaluation IEvaluation I

Page 21: Pagerank

Evaluation IIEvaluation II

Page 22: Pagerank

ConclusionConclusionPageRank is a global ranking based on the

web's graph structurePageRank use backlinks information to

bring order to the webPageRank can separate out representative

pages as cluster centerA great variety of applications


Related Documents