PageRank and related algorithms PageRank and HITS Jacob Kogan Department of Mathematics and Statistics University of Maryland, Baltimore County Baltimore, Maryland 21250 [email protected] May 15, 2006
PageRank and related algorithmsPageRank and HITS
Jacob Kogan
Department of Mathematics and StatisticsUniversity of Maryland, Baltimore County
Baltimore, Maryland 21250
May 15, 2006
Basic References
L. Page and S. Brin and R. Motwani and T. Winograd. ThePageRank citation index: bringing order to the web. StanfordDigital Library Technologies Project, 1998,citeseer.ist.psu.edu/page98pagerank.html.
Jon Kleinberg. Authoritative Sources in a HyperlinkedEnvironment. Journal of the ACM, 46:5, pp. 604-632, 1999.
Berkhin, P. A survey on Page Rank computing. InternetMathematics, vol. 2, no. 1, pp. 73–120, 2005.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 2/21
PageRank
PageRank
is a global “importance” ranking of every web page.
The method is based on the graph of the web.The model is inspired by academic citation analysis.
If a page has a link off an “important” page (Yahoo home page forexample), then this link should make a larger contribution to thepage “importance”, then links from “obscure” pages.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 3/21
The graph and the matrix
G(V,E) is a directed graph
V are the vertices/nodes (say n HTML pages)
E are the directes edges (hyperlinks)
The n × n adjacency matrix A = (Aij)
Aij =
{1 if page i −→ j0 otherwise
Jacob Kogan, UMBC PageRank and related algorithms, optimization 4/21
Transition matrix P
P = (Pij)
Pij =Aij
odeg(i)(odeg(i), the out degree of a node i ,
is the number of outgoing links)
so that∑
j
Pij = 1
(P is row stochastic)
Jacob Kogan, UMBC PageRank and related algorithms, optimization 5/21
Random Serfer Model
A surfer travels along the directed graph G .Pij , j = 1, . . . , n is the probability the surfer moves
from node i to node j .
If at step k the probability of the surfer being located at node i is
p(k)i , so that
p(k) =(p
(k)1 , . . . ,p
(k)n
),
thenp(k+1) = PTp(k).
p(k+1) is a probability distribution!
Jacob Kogan, UMBC PageRank and related algorithms, optimization 6/21
q = PTp
if p = (p1, . . . ,pn), pi ≥ 0,∑
pi = 1and q = (q1, . . . ,qn), q = PTpthen qi ≥ 0,
∑qi = 1.
n∑i=1
qi =n∑
j=1
(n∑
i=1
Pijpi
)=
n∑i=1
pi
n∑j=1
Pij
=n∑
i=1
pi = 1.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 7/21
q = PTp
if p = (p1, . . . ,pn), pi ≥ 0,∑
pi = 1and q = (q1, . . . ,qn), q = PTpthen qi ≥ 0,
∑qi = 1.
n∑i=1
qi =n∑
j=1
(n∑
i=1
Pijpi
)=
n∑i=1
pi
n∑j=1
Pij
=n∑
i=1
pi = 1.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 7/21
Dangling Pages
pages that have no outgoing links are calleddangling pages
orsinks
orattractors.
With dangling pages the transition matrix P has zero rows, andfails to be stochastic.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 8/21
PageRank
Definition. A PageRank vector is a non-negative stationary pointof the transformation
q = PTp
(a stationary distribution for a Markov chain)
What can be done in presence of dangling pages?
Jacob Kogan, UMBC PageRank and related algorithms, optimization 9/21
PageRank
Definition. A PageRank vector is a non-negative stationary pointof the transformation
q = PTp
(a stationary distribution for a Markov chain)
What can be done in presence of dangling pages?
Jacob Kogan, UMBC PageRank and related algorithms, optimization 9/21
What can be done?
removal of dangling pages,
renormalization of PTp(k+1),
to add self link to each dangling page,
to introduce an “ideal page” with a self link to each danglingpage,
to modify the matrix P by introducing artificial links thatuniformly connect dangling pages to pages (P ′ = P + dvT ).
Jacob Kogan, UMBC PageRank and related algorithms, optimization 10/21
PageRank
v =
1n· · ·1n
, d =
δ(odeg(1), 0)· · ·δ(odeg(n), 0)
Consider
P ′′ = c[P + dvT
]+ (1− c)evT .
y = P ′′Tx = cPTx + cv(dTx
)+ (1− c)v
(eTx
).
Jacob Kogan, UMBC PageRank and related algorithms, optimization 11/21
PageRank computation
Let x be a vector in Rn, and P = (Pij) is an n × n matrix with nonnegative entries such that
either∑
j
Pij = 1, or∑
j
Pij = 0.
Let d ∈ Rn so that di = δ(odeg(i), 0), then
|PTx| = |x| − dTx.
(where |y| = |y|1 = |y1|+ · · ·+ |yn|)
Jacob Kogan, UMBC PageRank and related algorithms, optimization 12/21
PageRank computation
PTx =
P11 P21 · · · Pn1
P12 P22 . . . Pn2
· · · · · · · · · · · ·P1n P2n · · · Pnn
x1
x2
· · ·xn
=
x1
P11
P12
· · ·P1n
+ x2
P21
P22
· · ·P2n
+ · · ·+ xn
Pn1
Pn2
· · ·Pnn
Hence
∣∣∣PTx∣∣∣ = x1
∑j
P1j
+ x2
∑j
P2j
+ xn
∑j
Pnj
.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 13/21
PageRank computation
PTx =
P11 P21 · · · Pn1
P12 P22 . . . Pn2
· · · · · · · · · · · ·P1n P2n · · · Pnn
x1
x2
· · ·xn
=
x1
P11
P12
· · ·P1n
+ x2
P21
P22
· · ·P2n
+ · · ·+ xn
Pn1
Pn2
· · ·Pnn
Hence
∣∣∣PTx∣∣∣ = x1
∑j
P1j
+ x2
∑j
P2j
+ xn
∑j
Pnj
.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 13/21
PageRank computation
∣∣PTx∣∣ = x1
(∑j P1j
)+ · · · + xn
(∑j Pnj
).
|x| − dTx = x1 + · · · + xn
−δ(odeg(1), 0)x1 + · · · + −δ(odeg(n), 0)xn
Jacob Kogan, UMBC PageRank and related algorithms, optimization 14/21
PageRank
y = P ′′Tx = cPTx + cv(dTx
)+ (1− c)v
(eTx
)︸ ︷︷ ︸ .
|x| −(c |x| − c
(dTx
))= |x| −
∣∣∣cPTx∣∣∣ .
Hence y can be computed as follows:
1. y←− cPTx,
2. γ = |x| − |y|,3. y←− y + γv.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 15/21
Hyperlink Induced Topic Search (HITS)
works with a subgraph specific to a particular query (ratherthan with a full graph),
computes two weights (authority and hub) for each webpage,
allows clustering of results for multi-topic or polarized queries.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 16/21
Root and Focused Sets
Root set:The top t (around 200) results are recalled for a given query(the results are picked according to a text based relevancecriterion).
Focused set:All pages pointed by out links of the root set are added alongwith up to d (about 50) pages corresponding to inlinks ofeach page in a root set.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 17/21
Hubs and Authorities
Define “authorities” and “hubs” as follows:
1. a page p is an authority if it is pointed by many pages,
2. a page p is a hub if it points to many pages.
To measure the “authority” and the “hub” of the pages weconsider L2 unit norm vectors a and h of dimension |V |, so that
a[p] is the “authority”
h[p] is the “hub”
of the page p.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 18/21
Hubs and Authorities
The following is an iterative process that computes the vectors.
1. set t = 0
2. assign initial values a(t), and h(t)
3. normalize vectors a(t), and h(t), so that∑p
(a(t)[p]
)2=∑p
(h(t)[p]
)2= 1
4. set a(t+1)[p] =∑
q−→p
h(t)[q], and h(t+1)[p] =∑
p−→q
a(t+1)[q]
5. if (stopping criterion fails)then increment t by 1, goto Step 3else stop.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 19/21
Adjacency Matrix
Let A be the adjacency matrix of the graph G , i.e.
Aij =
{1 if page i −→ j0 otherwise
Note that
a(t+1) =ATh(t)∣∣∣∣ATh(t)
∣∣∣∣ , and h(t+1) =Aa(t+1)∣∣∣∣Aa(t+1)
∣∣∣∣ .This yields a(t+1) =
ATAa(t)∣∣∣∣ATAa(t)∣∣∣∣ , and h(t+1) =
AATh(t)∣∣∣∣AATh(t)∣∣∣∣ .
Jacob Kogan, UMBC PageRank and related algorithms, optimization 20/21
Eigenvectors
a(t) =
(ATA
)ka(0)∣∣∣∣∣∣(ATA)
k a(0)∣∣∣∣∣∣ , and h(t) =
(AAT
)kh(0)∣∣∣∣∣∣(AAT )
k h(0)∣∣∣∣∣∣ .
Let v and w be a unit eigenvectors corresponding to maximaleigenvalues of the symmetric matrices ATA and AAT
correspondingly. The above arguments lead to the following result:
limt→∞
a(t) = v, limt→∞
h(t) = w.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 21/21