Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle
Jan 03, 2016
Overview of Web Ranking Algorithms: HITS and PageRank
April 6, 2006Presented by: Bill Eberle
Overview
Problem Web as a Graph HITS PageRank Comparison
Problem
Specific queries (scarcity problem). Broad-topic queries (abundance
problem). Goal: to find the smallest set of
“authoritative” sources.
Web as a Graph
Web pages as nodes of a graph. Links as directed edges.
www.uta.edu
my page www.uta.edu
www.google.com
www.google.com
my page
www.uta.edu
www.google.com
Link Structure of the Web Forward links (out-edges). Backward links (in-edges). Approximation of importance/quality:
a page may be of high quality if it is referred to by many other pages, and by pages of high quality.
HITS
HITS (Hyperlinked-Induced Topic Search)
“Authoritative Sources in a Hyperlinked Environment”, Jon Kleinberg, Cornell University. 1998.
Authorities and Hubs
Authority is a page which has relevant information about the topic.
Hub is a page which has collection of links to pages about that topic.
h
a1
a2
a3
a4
Authorities and Hubs (cont.) Good hubs are the ones that point to
good authorities. Good authorities are the ones that
are pointed to by good hubs.
h2
h3
h4
h5
a1
a2
a3
a4
a5
a6
h1
Finding Authorities and Hubs
First, construct a focused sub-graph of the www.
Second, compute Hubs and Authorities from the sub-graph.
Construction of Sub-graph
Topic Search Engine CrawlerRootsetPages
ExpandedsetPages
Rootset
Forward link pages
Root Set and Base Set Use query term to
collect a root set of pages from text-based search engine (AltaVista).
Root set
Root Set and Base Set (cont.) Expand root set into
base set by including (up to a designated size cut-off): All pages linked to
by pages in root set All pages that link to
a page in root set
Root set
Base set
Hubs & Authorities Calculation
Iterative algorithm on Base Set: authority weights a(p), and hub weights h(p).
Set authority weights a(p) = 1, and hub weights h(p) = 1 for all p.
Repeat following two operations(and then re-normalize a and h to have unit norm):
v1
pv2
v3
h(v2)
h(v3)
pq
pa topoints
h(q))(
v1
p
a(v1)
v2
v3
a(v2)
a(v3)
qp
aph topoints
(q))(
h(v1)
Example
Hub 0.45, Authority 0.45
0.45, 0.45
0.45, 0.45
0.45, 0.45
Example (cont.)
Hub 0.9, Authority 0.45
1.35, 0.9
0.45, 0.9
0.45, 0.9
Algorithmic Outcome Applying iterative multiplication
(power iteration) will lead to calculating eigenvector of any “non-degenerate” initial vector.
Hubs and authorities as outcome of process.
Principal eigenvector contains highest hub and authorities.
Results Although HITS is only link-based (it
completely disregards page content) results are quite good in many tested queries.
When the authors tested the query “search engines”: The algorithm returned Yahoo!, Excite,
Magellan, Lycos, AltaVista However, none of these pages described
themselves as a “search engine” (at the time of the experiment)
Issues From narrow topic, HITS tends to end
in more general one. Specific of hub pages - many links can
cause algorithm drift. They can point to authorities in different topics.
Pages from single domain / website can dominate result, if they point to one page - not necessarily a good authority.
Possible Enhancements Use weighted sums for link calculation. Take advantage of “anchor text” - text
surrounding link itself. Break hubs into smaller pieces. Analyze
each piece separately, instead of whole hub page as one.
Disregard or minimize influence of links inside one domain.
IBM expanded HITS into Clever; not seen as viable real-time search engine.
PageRank
“The PageRank Citation Ranking: Bringing Order to the Web”, Lawrence Page and Sergey Brin, Stanford University. 1998.
Basic Idea Back-links coming from important pages
convey more importance to a page. For example, if a web page has a link off the yahoo home page, it may be just one link but it is a very important one.
A page has high rank if the sum of the ranks of its back-links is high. This covers both the case when a page has many back-links and when a page has a few highly ranked back-links.
Definition My page’s rank is equal to the
sum of all the pages pointing to me.
vfromlinksofnumberN
utolinkswithpagesofsetB
N
vRankuRank
v
u
Bv vu
)()(
Simplified PageRank Example Rank(u) = Rank
of page u , where c is a normalization constant (c < 1 to cover for pages with no outgoing links).
Expanded Definition R(u): page rank of page u c: factor used for normalization (<1) Bu: set of pages pointing to u Nv: outbound links of v R(v): page rank of site v that points to u E(u): distribution of web pages that a random
surfer periodically jumps (set to 0.15)
)()(
)( ucEN
vRcuR
uBv v
Problem 1 - Rank Sink Page cycles pointed by some incoming
link.
Loop will accumulate rank but never distribute it.
Problem 2 - Dangling Links In general, many Web pages do not have either back links or
forward links.
Dangling links do not affect the ranking of any other page directly, so they are removed until all the PageRanks are calculated.
Random Surfer Model PageRank corresponds to the
probability distribution of a random walk on the web graphs.
Solution – Escape Term Escape term: E(u) can be thought of as the
random surfer gets bored periodically and jumps to a different page – not staying in the loop forever.
We term this E to be a vector over all the web pages that accounts for each page’s escape probability (user defined parameter).
)()(
)( ucEN
vRcuR
uBv v
PageRank Computation - initialize vector over web pages Loop: - new ranks sum of normalized backlink ranks
- compute normalizing factor
- add escape term
- control parameter
While - stop when converged
SR 0
iT
i RAR 1
111 ii RRd
dERR ii 11
ii RR 1
Matrices A is designated to be a matrix, u and v correspond
to the columns of this matrix.
Given that A is a matrix, and R be a vector over all the Web pages, the dominant eigenvector is the one associated with the maximal eigenvalue.
Example
AT=
Example (cont.)
A =
R = Normalized =
A x = λ x| A - λI | x = 0
R = c A R = M Rc : eigenvalueR : eigenvector of A
Implementation
1. URL -> id 2. Store each hyperlink in a database.3. Sort link structure by Parent id.4. Remove dangling links.5. Calculate the PR giving each page an
initial value.6. Iterate until convergence.7. Add the dangling links.
Example
1BN
Page A Page B
Page C
2AN
1CN
Which of these three has the highest page rank?
1BNPage A Page B
Page C
2AN
1CN
01
)(
2
)()(
002
)()(
1
)(00)(
BRankARankCRank
ARankBRank
CRankARank
Example (cont.)
Re-write the system of equations as a Matrix- Vector product.
)(
)(
)(
012
1
002
1
100
)(
)(
)(
CRank
BRank
ARank
CRank
BRank
ARank
The PageRank vector is simply an eigenvector (scalar*vector = matrix*vector) of the coefficient matrix.
Example (cont.)
1BNPage A Page B
Page C
2AN
1CN
PageRank = 0.4
PageRank = 0.4
PageRank = 0.2
Example (cont.)
0123....1112
with d= 0.5Pr(A) PR(B) PR(C)
A B
C
Example (cont.)
Convergence
PageRank computation is O(log(|V|)).
Other Applications
Help user decide if a site is trustworthy.
Estimate web traffic. Spam detection and prevention. Predict citation counts.
Issues
Users are not random walkers. Starting point distribution (actual
usage data as starting vector). Bias towards main pages. Linkage spam. No query specific rank.
PageRank vs. HITS
PageRank (Google) computed for all
web pages stored in the database prior to the query
computes authorities only
Trivial and fast to compute
HITS (CLEVER) performed on the
set of retrieved web pages for each query
computes authorities and hubs
easy to compute, but real-time execution is hard
References
“Authoritative Sources in a Hyperlinked Environment”, Jon Kleinberg, Cornell University. 1998.
“The PageRank Citation Ranking: Bringing Order to the Web”, Lawrence Page and Sergey Brin, Stanford University. 1998.