Top Banner
Lecture 4 CS492 Special Topics in Computer Science Distributed Algorithms and Systems
22

Lecture 4

Feb 24, 2016

Download

Documents

Yi-Ching Yeh

Lecture 4. CS492 Special Topics in Computer Science Distributed Algorithms and Systems. “The PageRank Citation Ranking: Bringing Order to the Web”. L. Page, S. Brin , R. Motwani , T Winograd 1998. Origin of “Google”. Hostnames Active. http://news.netcraft.com. Googol 10^100 - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 4

Lecture 4

CS492 Special Topics in Computer ScienceDistributed Algorithms and Systems

Page 2: Lecture 4

“The PageRank Citation Ranking:Bringing Order to the Web”

L. Page, S. Brin, R. Motwani, T Winograd1998

Fall 2008 CS492

Page 3: Lecture 4

3Fall 2008 CS492

Origin of “Google” Googol

10^100 Motivation behind

Human maintained indices such as Yahoo! Explosive growth

http://news.netcraft.com

HostnamesActive

Page 4: Lecture 4

4Fall 2008 CS492

Design Goals of Google Improved search quality

In 1997, 1 out of 4 top search engines found itself High precision in finding relevant document was necessary

Academic search engine research Search engine technology went commercial: an black art To build systems that a good number of people could use To build an architecture to support novel research on

large-scale Web data

Page 5: Lecture 4

5Fall 2008 CS492

Weakness of Existing Approaches Calculate similarities

Based on flat, vector-space model of each page Prone to cheating (Web spamming or search engine per-

suasion)

Page 6: Lecture 4

6Fall 2008 CS492

Basic Idea of PageRank Exploit the topological structure of hypertextual sys-

tems

Page 7: Lecture 4

Fall 2008 CS492 7

Simple Example

A

C

B0.2

0.4

0.4

Page 8: Lecture 4

8Fall 2008 CS492

Related Work Academic citation analysis

Similarities Graph structure; paper = node, web page = node citation = link, URL = link “node” authority independent of “node” content

Differences Uniform unit of info (paper) versus great variability in quality, usage, citations, and length Equal link weight vs variable importance A backlink from Yahoo! vs. from a friend

Page 9: Lecture 4

Fall 2008 CS492 9

Which Page Should Be Ranked Higher?

A B

John Doe

Page 10: Lecture 4

10Fall 2008 CS492

Simple Expression

page rank of set of pages pointing at

out-degree of

Question: role of c?Answer: total rank of all web pages constant

Page 11: Lecture 4

11Fall 2008 CS492

Dangling links Pages without outgoing pointers

Example: Pages not yet downloaded Do not affect the calculation much

Remove them, calculate ranks, and add them back

Page 12: Lecture 4

12Fall 2008 CS492

Loop

A

C

B

Question: ranks of A, B, and C?Answer: infinite! (rank sink)

Page 13: Lecture 4

13Fall 2008 CS492

Basic Algorithm

page rank of set of pages pointing at

out-degree of

dumping factor

Page 14: Lecture 4

14Fall 2008 CS492

Matrix Representation

Question: Where to start?

where and

Page 15: Lecture 4

15Fall 2008 CS492

Iterative Algorithm

where and

Question: Will it converge?

Page 16: Lecture 4

16Fall 2008 CS492

Example

[LM04]

Page 17: Lecture 4

17Fall 2008 CS492

Turn the Problem into a Markov Process

[LM04]

Page 18: Lecture 4

18Fall 2008 CS492

Evenly Split Rank of Dangling Links

[LM04]

Page 19: Lecture 4

19Fall 2008 CS492

Final Solution Eigenvector of P = steady state rank

Page 20: Lecture 4

20Fall 2008 CS492

Spam Rank

[BGS05]

Page 21: Lecture 4

21Fall 2008 CS492

Questions Where to start?

Find a nondegenerate start vector What if there are two pages that point to each other

and no one else and there is a page that points to one of them? Role of dumping factor guarantees no rank sink

Page 22: Lecture 4

22Fall 2008 CS492

References[BP98] Sergey Brin, Lawrence Page, “The anatomy of a large-scale hypertextual Web search en-

gine,” Computer Networks and ISDN Systems, Vol. 30, 1998.[BGS05] Monica Bianchini, Marco Gori, Franco Scarselli, “Inside PageRank,” ACM Transactions on

Internet Technology, Vol. 5, No. 1, Feb. 2005.[LM04] Amy N. Langville, Carl Meyer, “Deeper inside PageRank,” Internet Mathematics, Vol. I, No.

3, 2004.[K99] Jon Kleinberg, “Authoritative sources in a Hyperlinked Environment,” Journal of the ACM

46:5 (1999).