Top Banner
Link Analysis: PageRank
43

Link Analysis: PageRank

Mar 22, 2016

Download

Documents

Padma

Link Analysis: PageRank. Ranking Nodes on the Graph. vs. Web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu Since there is large diversity in the connectivity of the web graph we can rank the pages by the link structure. Link Analysis Algorithms. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Link Analysis:  PageRank

Link Analysis: PageRank

Page 2: Link Analysis:  PageRank

Ranking Nodes on the Graph

• Web pages are not equally “important”www.joe-schmoe.com vs. www.stanford.edu

• Since there is large diversity in the connectivity of the web graph we can rank the pages by the link structure

Slides by Jure Leskovec: Mining Massive Datasets 2

vs.

Page 3: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 3

Link Analysis Algorithms

• We will cover the following Link Analysis approaches to computing importances of nodes in a graph:– Page Rank– Hubs and Authorities (HITS)– Topic-Specific (Personalized) Page Rank– Web Spam Detection Algorithms

Page 4: Link Analysis:  PageRank

Links as Votes

• Idea: Links as votes– Page is more important if it has more links

• In-coming links? Out-going links?

• Think of in-links as votes:– www.stanford.edu has 23,400 inlinks– www.joe-schmoe.com has 1 inlink

• Are all in-links are equal?– Links from important pages count more– Recursive question!

Slides by Jure Leskovec: Mining Massive Datasets 4

Page 5: Link Analysis:  PageRank

Simple Recursive Formulation

• Each link’s vote is proportional to the importance of its source page

• If page p with importance x has n out-links, each link gets x/n votes

• Page p’s own importance is the sum of the votes on its in-links

Slides by Jure Leskovec: Mining Massive Datasets 5

p

Page 6: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 6

PageRank: The “Flow” Model• A “vote” from an important

page is worth more• A page is important if it is

pointed to by other important pages

• Define a “rank” rj for node j

ji

ij

rr(i)dout

y

maa/2

y/2a/2

m

y/2

The web in 1839

Flow equations:ry = ry /2 + ra /2ra = ry /2 + rm

rm = ra /2

Page 7: Link Analysis:  PageRank

Solving the Flow Equations

• 3 equations, 3 unknowns, no constants– No unique solution

• Additional constraint forces uniqueness– ry + ra + rm = 1– ry = 2/5, ra = 2/5, rm = 1/5

• Gaussian elimination method works for small examples, but we need a better method for large web-size graphs

Slides by Jure Leskovec: Mining Massive Datasets 7

ry = ry /2 + ra /2ra = ry /2 + rm

rm = ra /2

Flow equations:

Page 8: Link Analysis:  PageRank

PageRank: Matrix Formulation• Stochastic adjacency matrix M

– Let page j has dj out-links– If j → i, then Mij = 1/dj else Mij = 0

• M is a column stochastic matrix– Columns sum to 1

• Rank vector r: vector with an entry per page– ri is the importance score of page i– i ri = 1

• The flow equations can be written r = M r

Slides by Jure Leskovec: Mining Massive Datasets 8

Page 9: Link Analysis:  PageRank

Example

• Suppose page j links to 3 pages, including i

Slides by Jure Leskovec: Mining Massive Datasets 9

i

j

M r r

=i

1/3

Page 10: Link Analysis:  PageRank

Eigenvector Formulation

• The flow equations can be written r = M ∙ r

• So the rank vector is an eigenvector of the stochastic web matrix– In fact, its first or principal eigenvector, with

corresponding eigenvalue 1

Slides by Jure Leskovec: Mining Massive Datasets 10

Page 11: Link Analysis:  PageRank

Example: Flow Equations & M

Slides by Jure Leskovec: Mining Massive Datasets 11

r = Mr

y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m

y

a m

y a my ½ ½ 0a ½ 0 1

m 0 ½ 0

ry = ry /2 + ra /2ra = ry /2 + rm

rm = ra /2

Page 12: Link Analysis:  PageRank

Power Iteration Method

• Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks

• Power iteration: a simple iterative scheme– Suppose there are N web pages– Initialize: r(0) = [1/N,….,1/N]T

– Iterate: r(t+1) = M ∙ r(t)

– Stop when |r(t+1) – r(t)|1 < • |x|1 = 1≤i≤N|xi| is the L1 norm • Can use any other vector norm e.g., Euclidean

Slides by Jure Leskovec: Mining Massive Datasets 12

ji

tit

jrr

i

)()1(

ddi …. out-degree of node i

Page 13: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 13

PageRank: How to solve?

• Power Iteration:– Set /N

• And iterate• ri=j Mij r∙ j

• Example:ry 1/3 1/3 5/12 9/24 6/15ra = 1/3 3/6 1/3 11/24 … 6/15rm 1/3 1/6 3/12 1/6 3/15

y

a m

y a my ½ ½ 0

a ½ 0 1

m 0 ½ 0

Iteration 0, 1, 2, …

ry = ry /2 + ra /2ra = ry /2 + rm

rm = ra /2

Page 14: Link Analysis:  PageRank

Random Walk Interpretation

Imagine a random web surfer:– At any time t, surfer is on some page u– At time t+1, the surfer follows an

out-link from u uniformly at random– Ends up on some page v linked from u– Process repeats indefinitely

Let:p(t) … vector whose ith coordinate is the

prob. that the surfer is at page i at time t– p(t) is a probability distribution over pages

Slides by Jure Leskovec: Mining Massive Datasets 14

ji

ij

rr(i)dout

j

i1 i2 i3

Page 15: Link Analysis:  PageRank

The Stationary Distribution

• Where is the surfer at time t+1?– Follows a link uniformly at random

p(t+1) = M · p(t)• Suppose the random walk reaches a state

p(t+1) = M · p(t) = p(t)then p(t) is stationary distribution of a random walk

Our rank vector r satisfies r = M · r– So, it is a stationary distribution for

the random walk

Slides by Jure Leskovec: Mining Massive Datasets 15

)()1( tptp Mj

i1 i2 i3

Page 16: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 16

PageRank: Three Questions

• Does this converge?

• Does it converge to what we want?

• Are results reasonable?

ji

tit

jrr

i

)()1(

d Mrr or equivalently

Page 17: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 17

Does This Converge?

• Example:ra 1 0 1 0rb 0 1 0 1

=

ba

Iteration 0, 1, 2, …

ji

tit

jrr

i

)()1(

d

Page 18: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 18

Does it Converge to What We Want?

• Example:ra 1 0 0 0rb 0 1 0 0=

ba

Iteration 0, 1, 2, …

ji

tit

jrr

i

)()1(

d

Page 19: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 19

Problems with the “Flow” Model

2 problems:• Some pages are “dead ends”

(have no out-links)– Such pages cause

importance to “leak out”

• Spider traps (all out links arewithin the group)– Eventually spider traps absorb all importance

Page 20: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 20

Problem: Spider Traps

• Power Iteration:– Set

• And iterate

• Example:ry 1/3 2/6 3/12 5/24 0ra = 1/3 1/6 2/12 3/24 … 0rm 1/3 3/6 7/12 16/24 1

Iteration 0, 1, 2, …

y

a m

y a my ½ ½ 0

a ½ 0 0

m 0 ½ 1

ry = ry /2 + ra /2ra = ry /2rm = ra /2 + rm

Page 21: Link Analysis:  PageRank

Solution: Random Teleports• The Google solution for spider traps: At each

time step, the random surfer has two options:– With probability , follow a link at random– With probability 1-, jump to some page

uniformly at random– Common values for are in the range 0.8 to 0.9

• Surfer will teleport out of spider trap within a few time steps

Slides by Jure Leskovec: Mining Massive Datasets 21

y

a m

y

a m

Page 22: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 22

Problem: Dead Ends

• Power Iteration:– Set

• And iterate

• Example:ry 1/3 2/6 3/12 5/24 0ra = 1/3 1/6 2/12 3/24 … 0rm 1/3 1/6 1/12 2/24 0

Iteration 0, 1, 2, …

y

a m

y a my ½ ½ 0

a ½ 0 0

m 0 ½ 0

ry = ry /2 + ra /2ra = ry /2rm = ra /2

Page 23: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 23

Solution: Dead Ends• Teleports: Follow random teleport links with

probability 1.0 from dead-ends– Adjust matrix accordingly

y

a my a m

y ½ ½ ⅓a ½ 0 ⅓

m 0 ½ ⅓

y a my ½ ½ 0

a ½ 0 0

m 0 ½ 0

y

a m

Page 24: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 24

Why Teleports Solve the Problem?

Markov Chains• Set of states X• Transition matrix P where Pij = P(Xt=i | Xt-1=j)• π specifying the probability of being at each

state x X• Goal is to find π such that π = P π

)()1( tt Mrr

Page 25: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 25

Why is This Analogy Useful?

• Theory of Markov chains

• Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic.

Page 26: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 26

Make M Stochastic

• Stochastic: Every column sums to 1• A possible solution: Add green links

y

a m

y a my ½ ½ 1/3a ½ 0 1/3

m 0 ½ 1/3

ry = ry /2 + ra /2 + rm /3ra = ry /2+ rm /3rm = ra /2 + rm /3

)1( 1n

aMS T• ai…=1 if node i has

out deg 0, =0 else• 1…vector of all 1s

Page 27: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 27

Make M Aperiodic

• A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k.

• A possible solution: Add green links

y

a m

Page 28: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 28

Make M Irreducible

• From any state, there is a non-zero probability of going from any one state to any another

• A possible solution: Add green links

y

a m

Page 29: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 29

Solution: Random Jumps

• Google’s solution that does it all:– Makes M stochastic, aperiodic, irreducible

• At each step, random surfer has two options:– With probability 1-, follow a link at random– With probability , jump to some random page

• PageRank equation [Brin-Page, 98]

di … out-degree of node i

From now on: We assume M has no dead endsThat is, we follow random teleport links

with probability 1.0 from dead-ends

Page 30: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 30

The Google Matrix• PageRank equation [Brin-Page, 98]

• The Google Matrix A:

• G is stochastic, aperiodic and irreducible, so

• What is ?– In practice =0.85 (make 5 steps and jump)

Page 31: Link Analysis:  PageRank

Random Teleports ( 0.8)

Slides by Jure Leskovec: Mining Massive Datasets 31

ya =m

1/31/31/3

0.330.200.46

0.240.200.52

0.260.180.56

7/33 5/3321/33

. . .

y

a m

0.8·

½+0

.2·⅓

0.8·½+0.2·⅓

0.2·⅓

0.8+0.2·⅓

0.2·⅓

0.2· ⅓

0.2· ⅓

0.8·½+0.2·⅓ 0.

8·½

+0.2

·⅓

1/2 1/2 0 1/2 0 0 0 1/2 1

1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

y 7/15 7/15 1/15a 7/15 1/15 1/15m 1/15 7/15 13/15

0.8 + 0.2

S 1/n·1·1T

A

Page 32: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 32

Computing Page Rank• Key step is matrix-vector multiplication

– rnew = A ∙ rold

• Easy if we have enough main memory to hold A, rold, rnew

• Say N = 1 billion pages– We need 4 bytes for

each entry (say)– 2 billion entries for

vectors, approx 8GB– Matrix A has N2 entries

• 1018 is a large number!

½ ½ 0 ½ 0 00 ½ 1

1/3 1/3 1/31/3 1/3 1/31/3 1/3 1/3

7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15

0.8 +0.2

A = M∙ + (1-) [1/N]NxN

=

A =

Page 33: Link Analysis:  PageRank

Matrix Formulation• Suppose there are N pages• Consider a page j, with set of out-links dj

• We have Mij = 1/|dj| when j→i and Mij = 0 otherwise

• The random teleport is equivalent to– Adding a teleport link from j to every other page with

probability (1-)/N– Reducing the probability of following each out-link from 1/|

dj| to /|dj|– Equivalent: Tax each page a fraction (1-) of its score and

redistribute evenly

Slides by Jure Leskovec: Mining Massive Datasets 33

Page 34: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 34

Rearranging the Equation

• , where

since

• So we get:

[x]N … a vector of length N with all entries x

Page 35: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 35

Sparse Matrix Formulation

• We just rearranged the PageRank equation

• where [(1-)/N]N is a vector with all N entries (1-)/N

• M is a sparse matrix!– 10 links per node, approx 10N entries

• So in each iteration, we need to:– Compute rnew = M ∙ rold

– Add a constant value (1-)/N to each entry in rnew

Page 36: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 36

Sparse Matrix Encoding

• Encode sparse matrix using only nonzero entries– Space proportional roughly to number of links– Say 10N, or 4*10*1 billion = 40GB– Still won’t fit in memory, but will fit on disk

0 3 1, 5, 71 5 17, 64, 113, 117, 2452 2 13, 23

sourcenode degree destination nodes

Page 37: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 37

Basic Algorithm: Update Step• Assume enough RAM to fit rnew into memory

– Store rold and matrix M on disk• Then 1 step of power-iteration is:

0 3 1, 5, 61 4 17, 64, 113, 1172 2 13, 23

src degree destination0123456

0123456

rnew rold

Initialize all entries of rnew to (1-)/NFor each page p (of out-degree n):

Read into memory: p, n, dest1,…,destn, rold(p)for j = 1…n: rnew(destj) += rold(p) / n

Page 38: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 38

Analysis

• Assume enough RAM to fit rnew into memory– Store rold and matrix M on disk

• In each iteration, we have to:– Read rold and M– Write rnew back to disk– IO cost = 2|r| + |M|

• Question:– What if we could not even fit rnew in memory?

Page 39: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 39

Block-based Update Algorithm

0 4 0, 1, 3, 51 2 0, 52 2 3, 4

src degree destination01

23

45

012345

rnew rold

Page 40: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 40

Analysis of Block Update

• Similar to nested-loop join in databases– Break rnew into k blocks that fit in memory– Scan M and rold once for each block

• k scans of M and rold

– k(|M| + |r|) + |r| = k|M| + (k+1)|r|

• Can we do better?– Hint: M is much bigger than r (approx 10-20x), so

we must avoid reading it k times per iteration

Page 41: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 41

Block-Stripe Update Algorithm0 4 0, 11 3 02 2 1

src degree destination

01

23

45

012345

rnew

rold

0 4 51 3 52 2 4

0 4 32 2 3

Page 42: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 42

Block-Stripe Analysis

• Break M into stripes– Each stripe contains only destination nodes in the

corresponding block of rnew

• Some additional overhead per stripe– But it is usually worth it

• Cost per iteration– |M|(1+) + (k+1)|r|

Page 43: Link Analysis:  PageRank

Slides by Jure Leskovec: Mining Massive Datasets 43

Some Problems with Page Rank

• Measures generic popularity of a page– Biased against topic-specific authorities– Solution: Topic-Specific PageRank (next)

• Uses a single measure of importance– Other models e.g., hubs-and-authorities– Solution: Hubs-and-Authorities (next)

• Susceptible to Link spam– Artificial link topographies created in order to boost page

rank– Solution: TrustRank (next)