Top Banner
Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle
43

Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Jan 03, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Overview of Web Ranking Algorithms: HITS and PageRank

April 6, 2006Presented by: Bill Eberle

Page 2: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Overview

Problem Web as a Graph HITS PageRank Comparison

Page 3: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Problem

Specific queries (scarcity problem). Broad-topic queries (abundance

problem). Goal: to find the smallest set of

“authoritative” sources.

Page 4: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Web as a Graph

Web pages as nodes of a graph. Links as directed edges.

www.uta.edu

my page www.uta.edu

www.google.com

www.google.com

my page

www.uta.edu

www.google.com

Page 5: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Link Structure of the Web Forward links (out-edges). Backward links (in-edges). Approximation of importance/quality:

a page may be of high quality if it is referred to by many other pages, and by pages of high quality.

Page 6: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

HITS

HITS (Hyperlinked-Induced Topic Search)

“Authoritative Sources in a Hyperlinked Environment”, Jon Kleinberg, Cornell University. 1998.

Page 7: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Authorities and Hubs

Authority is a page which has relevant information about the topic.

Hub is a page which has collection of links to pages about that topic.

h

a1

a2

a3

a4

Page 8: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Authorities and Hubs (cont.) Good hubs are the ones that point to

good authorities. Good authorities are the ones that

are pointed to by good hubs.

h2

h3

h4

h5

a1

a2

a3

a4

a5

a6

h1

Page 9: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Finding Authorities and Hubs

First, construct a focused sub-graph of the www.

Second, compute Hubs and Authorities from the sub-graph.

Page 10: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Construction of Sub-graph

Topic Search Engine CrawlerRootsetPages

ExpandedsetPages

Rootset

Forward link pages

Page 11: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Root Set and Base Set Use query term to

collect a root set of pages from text-based search engine (AltaVista).

Root set

Page 12: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Root Set and Base Set (cont.) Expand root set into

base set by including (up to a designated size cut-off): All pages linked to

by pages in root set All pages that link to

a page in root set

Root set

Base set

Page 13: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Hubs & Authorities Calculation

Iterative algorithm on Base Set: authority weights a(p), and hub weights h(p).

Set authority weights a(p) = 1, and hub weights h(p) = 1 for all p.

Repeat following two operations(and then re-normalize a and h to have unit norm):

v1

pv2

v3

h(v2)

h(v3)

pq

pa topoints

h(q))(

v1

p

a(v1)

v2

v3

a(v2)

a(v3)

qp

aph topoints

(q))(

h(v1)

Page 14: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Example

Hub 0.45, Authority 0.45

0.45, 0.45

0.45, 0.45

0.45, 0.45

Page 15: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Example (cont.)

Hub 0.9, Authority 0.45

1.35, 0.9

0.45, 0.9

0.45, 0.9

Page 16: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Algorithmic Outcome Applying iterative multiplication

(power iteration) will lead to calculating eigenvector of any “non-degenerate” initial vector.

Hubs and authorities as outcome of process.

Principal eigenvector contains highest hub and authorities.

Page 17: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Results Although HITS is only link-based (it

completely disregards page content) results are quite good in many tested queries.

When the authors tested the query “search engines”: The algorithm returned Yahoo!, Excite,

Magellan, Lycos, AltaVista However, none of these pages described

themselves as a “search engine” (at the time of the experiment)

Page 18: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Issues From narrow topic, HITS tends to end

in more general one. Specific of hub pages - many links can

cause algorithm drift. They can point to authorities in different topics.

Pages from single domain / website can dominate result, if they point to one page - not necessarily a good authority.

Page 19: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Possible Enhancements Use weighted sums for link calculation. Take advantage of “anchor text” - text

surrounding link itself. Break hubs into smaller pieces. Analyze

each piece separately, instead of whole hub page as one.

Disregard or minimize influence of links inside one domain.

IBM expanded HITS into Clever; not seen as viable real-time search engine.

Page 20: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

PageRank

“The PageRank Citation Ranking: Bringing Order to the Web”, Lawrence Page and Sergey Brin, Stanford University. 1998.

Page 21: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Basic Idea Back-links coming from important pages

convey more importance to a page. For example, if a web page has a link off the yahoo home page, it may be just one link but it is a very important one.

A page has high rank if the sum of the ranks of its back-links is high. This covers both the case when a page has many back-links and when a page has a few highly ranked back-links.

Page 22: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Definition My page’s rank is equal to the

sum of all the pages pointing to me.

vfromlinksofnumberN

utolinkswithpagesofsetB

N

vRankuRank

v

u

Bv vu

)()(

Page 23: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Simplified PageRank Example Rank(u) = Rank

of page u , where c is a normalization constant (c < 1 to cover for pages with no outgoing links).

Page 24: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Expanded Definition R(u): page rank of page u c: factor used for normalization (<1) Bu: set of pages pointing to u Nv: outbound links of v R(v): page rank of site v that points to u E(u): distribution of web pages that a random

surfer periodically jumps (set to 0.15)

)()(

)( ucEN

vRcuR

uBv v

Page 25: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Problem 1 - Rank Sink Page cycles pointed by some incoming

link.

Loop will accumulate rank but never distribute it.

Page 26: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Problem 2 - Dangling Links In general, many Web pages do not have either back links or

forward links.

Dangling links do not affect the ranking of any other page directly, so they are removed until all the PageRanks are calculated.

Page 27: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Random Surfer Model PageRank corresponds to the

probability distribution of a random walk on the web graphs.

Page 28: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Solution – Escape Term Escape term: E(u) can be thought of as the

random surfer gets bored periodically and jumps to a different page – not staying in the loop forever.

We term this E to be a vector over all the web pages that accounts for each page’s escape probability (user defined parameter).

)()(

)( ucEN

vRcuR

uBv v

Page 29: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

PageRank Computation - initialize vector over web pages Loop: - new ranks sum of normalized backlink ranks

- compute normalizing factor

- add escape term

- control parameter

While - stop when converged

SR 0

iT

i RAR 1

111 ii RRd

dERR ii 11

ii RR 1

Page 30: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Matrices A is designated to be a matrix, u and v correspond

to the columns of this matrix.

Given that A is a matrix, and R be a vector over all the Web pages, the dominant eigenvector is the one associated with the maximal eigenvalue.

Page 31: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Example

AT=

Page 32: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Example (cont.)

A =

R = Normalized =

A x = λ x| A - λI | x = 0

R = c A R = M Rc : eigenvalueR : eigenvector of A

Page 33: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Implementation

1. URL -> id 2. Store each hyperlink in a database.3. Sort link structure by Parent id.4. Remove dangling links.5. Calculate the PR giving each page an

initial value.6. Iterate until convergence.7. Add the dangling links.

Page 34: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Example

1BN

Page A Page B

Page C

2AN

1CN

Which of these three has the highest page rank?

Page 35: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

1BNPage A Page B

Page C

2AN

1CN

01

)(

2

)()(

002

)()(

1

)(00)(

BRankARankCRank

ARankBRank

CRankARank

Example (cont.)

Page 36: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Re-write the system of equations as a Matrix- Vector product.

)(

)(

)(

012

1

002

1

100

)(

)(

)(

CRank

BRank

ARank

CRank

BRank

ARank

The PageRank vector is simply an eigenvector (scalar*vector = matrix*vector) of the coefficient matrix.

Example (cont.)

Page 37: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

1BNPage A Page B

Page C

2AN

1CN

PageRank = 0.4

PageRank = 0.4

PageRank = 0.2

Example (cont.)

Page 38: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

0123....1112

with d= 0.5Pr(A) PR(B) PR(C)

A B

C

Example (cont.)

Page 39: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Convergence

PageRank computation is O(log(|V|)).

Page 40: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Other Applications

Help user decide if a site is trustworthy.

Estimate web traffic. Spam detection and prevention. Predict citation counts.

Page 41: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

Issues

Users are not random walkers. Starting point distribution (actual

usage data as starting vector). Bias towards main pages. Linkage spam. No query specific rank.

Page 42: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

PageRank vs. HITS

PageRank (Google) computed for all

web pages stored in the database prior to the query

computes authorities only

Trivial and fast to compute

HITS (CLEVER) performed on the

set of retrieved web pages for each query

computes authorities and hubs

easy to compute, but real-time execution is hard

Page 43: Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle.

References

“Authoritative Sources in a Hyperlinked Environment”, Jon Kleinberg, Cornell University. 1998.

“The PageRank Citation Ranking: Bringing Order to the Web”, Lawrence Page and Sergey Brin, Stanford University. 1998.