PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

PageRank and related algorithmsPageRank and HITS

Jacob Kogan

Department of Mathematics and StatisticsUniversity of Maryland, Baltimore County

Baltimore, Maryland 21250

[email protected]

May 15, 2006

Basic References

L. Page and S. Brin and R. Motwani and T. Winograd. ThePageRank citation index: bringing order to the web. StanfordDigital Library Technologies Project, 1998,citeseer.ist.psu.edu/page98pagerank.html.

Jon Kleinberg. Authoritative Sources in a HyperlinkedEnvironment. Journal of the ACM, 46:5, pp. 604-632, 1999.

Berkhin, P. A survey on Page Rank computing. InternetMathematics, vol. 2, no. 1, pp. 73–120, 2005.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 2/21

PageRank

PageRank

is a global “importance” ranking of every web page.

The method is based on the graph of the web.The model is inspired by academic citation analysis.

If a page has a link off an “important” page (Yahoo home page forexample), then this link should make a larger contribution to thepage “importance”, then links from “obscure” pages.


The graph and the matrix

G(V,E) is a directed graph

V are the vertices/nodes (say n HTML pages)

E are the directes edges (hyperlinks)

The n × n adjacency matrix A = (Aij)

Aij =

{1 if page i −→ j0 otherwise


Transition matrix P

P = (Pij)

Pij =Aij

odeg(i)(odeg(i), the out degree of a node i ,

is the number of outgoing links)

so that∑

j

Pij = 1

(P is row stochastic)


Random Serfer Model

A surfer travels along the directed graph G .Pij , j = 1, . . . , n is the probability the surfer moves

from node i to node j .

If at step k the probability of the surfer being located at node i is

p(k)i , so that

p(k) =(p

(k)1 , . . . ,p

(k)n

),

thenp(k+1) = PTp(k).

p(k+1) is a probability distribution!


q = PTp

if p = (p1, . . . ,pn), pi ≥ 0,∑

pi = 1and q = (q1, . . . ,qn), q = PTpthen qi ≥ 0,

∑qi = 1.

n∑i=1

qi =n∑

j=1

(n∑

i=1

Pijpi

)=

n∑i=1

pi

n∑j=1

Pij

=n∑

i=1

pi = 1.


q = PTp

if p = (p1, . . . ,pn), pi ≥ 0,∑

pi = 1and q = (q1, . . . ,qn), q = PTpthen qi ≥ 0,

∑qi = 1.

n∑i=1

qi =n∑

j=1

(n∑

i=1

Pijpi

)=

n∑i=1

pi

n∑j=1

Pij

=n∑

i=1

pi = 1.


Dangling Pages

pages that have no outgoing links are calleddangling pages

orsinks

orattractors.

With dangling pages the transition matrix P has zero rows, andfails to be stochastic.


PageRank

Definition. A PageRank vector is a non-negative stationary pointof the transformation

q = PTp

(a stationary distribution for a Markov chain)

What can be done in presence of dangling pages?


PageRank

Definition. A PageRank vector is a non-negative stationary pointof the transformation

q = PTp

(a stationary distribution for a Markov chain)

What can be done in presence of dangling pages?


What can be done?

removal of dangling pages,

renormalization of PTp(k+1),

to add self link to each dangling page,

to introduce an “ideal page” with a self link to each danglingpage,

to modify the matrix P by introducing artificial links thatuniformly connect dangling pages to pages (P ′ = P + dvT ).


PageRank

v =

1n· · ·1n

, d =

δ(odeg(1), 0)· · ·δ(odeg(n), 0)

Consider

P ′′ = c[P + dvT

]+ (1− c)evT .

y = P ′′Tx = cPTx + cv(dTx

)+ (1− c)v

(eTx

).


PageRank computation

Let x be a vector in Rn, and P = (Pij) is an n × n matrix with nonnegative entries such that

either∑

j

Pij = 1, or∑

j

Pij = 0.

Let d ∈ Rn so that di = δ(odeg(i), 0), then

|PTx| = |x| − dTx.

(where |y| = |y|1 = |y1|+ · · ·+ |yn|)



PTx =

P11 P21 · · · Pn1

P12 P22 . . . Pn2

· · · · · · · · · · · ·P1n P2n · · · Pnn

x1

x2

· · ·xn

=

x1

P11

P12

· · ·P1n

+ x2

P21

P22

· · ·P2n

+ · · ·+ xn

Pn1

Pn2

· · ·Pnn

Hence

∣∣∣PTx∣∣∣ = x1

∑j

P1j

+ x2

∑j

P2j

+ xn

∑j

Pnj

.



PTx =

P11 P21 · · · Pn1

P12 P22 . . . Pn2

· · · · · · · · · · · ·P1n P2n · · · Pnn

x1

x2

· · ·xn

=

x1

P11

P12

· · ·P1n

+ x2

P21

P22

· · ·P2n

+ · · ·+ xn

Pn1

Pn2

· · ·Pnn

Hence

∣∣∣PTx∣∣∣ = x1

∑j

P1j

+ x2

∑j

P2j

+ xn

∑j

Pnj

.



∣∣PTx∣∣ = x1

(∑j P1j

)+ · · · + xn

(∑j Pnj

).

|x| − dTx = x1 + · · · + xn

−δ(odeg(1), 0)x1 + · · · + −δ(odeg(n), 0)xn


PageRank

y = P ′′Tx = cPTx + cv(dTx

)+ (1− c)v

(eTx

)︸︷︷︸ .

|x| −(c |x| − c

(dTx

))= |x| −

∣∣∣cPTx∣∣∣ .

Hence y can be computed as follows:

1. y←− cPTx,

2. γ = |x| − |y|,3. y←− y + γv.


Hyperlink Induced Topic Search (HITS)

works with a subgraph specific to a particular query (ratherthan with a full graph),

computes two weights (authority and hub) for each webpage,

allows clustering of results for multi-topic or polarized queries.


Root and Focused Sets

Root set:The top t (around 200) results are recalled for a given query(the results are picked according to a text based relevancecriterion).

Focused set:All pages pointed by out links of the root set are added alongwith up to d (about 50) pages corresponding to inlinks ofeach page in a root set.


Hubs and Authorities

Define “authorities” and “hubs” as follows:

1. a page p is an authority if it is pointed by many pages,

2. a page p is a hub if it points to many pages.

To measure the “authority” and the “hub” of the pages weconsider L2 unit norm vectors a and h of dimension |V |, so that

a[p] is the “authority”

h[p] is the “hub”

of the page p.


Hubs and Authorities

The following is an iterative process that computes the vectors.

1. set t = 0

2. assign initial values a(t), and h(t)

3. normalize vectors a(t), and h(t), so that∑p

(a(t)[p]

)2=∑p

(h(t)[p]

)2= 1

4. set a(t+1)[p] =∑

q−→p

h(t)[q], and h(t+1)[p] =∑

p−→q

a(t+1)[q]

5. if (stopping criterion fails)then increment t by 1, goto Step 3else stop.


Adjacency Matrix

Let A be the adjacency matrix of the graph G , i.e.

Aij =

{1 if page i −→ j0 otherwise

Note that

a(t+1) =ATh(t)∣∣∣∣ATh(t)

∣∣∣∣ , and h(t+1) =Aa(t+1)∣∣∣∣Aa(t+1)

∣∣∣∣ .This yields a(t+1) =

ATAa(t)∣∣∣∣ATAa(t)∣∣∣∣ , and h(t+1) =

AATh(t)∣∣∣∣AATh(t)∣∣∣∣ .


Eigenvectors

a(t) =

(ATA

)ka(0)∣∣∣∣∣∣(ATA)

k a(0)∣∣∣∣∣∣ , and h(t) =

(AAT

)kh(0)∣∣∣∣∣∣(AAT )

k h(0)∣∣∣∣∣∣ .

Let v and w be a unit eigenvectors corresponding to maximaleigenvalues of the symmetric matrices ATA and AAT

correspondingly. The above arguments lead to the following result:

limt→∞

a(t) = v, limt→∞

h(t) = w.


PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

Documents