The PageRank Citation Ranking: The PageRank Citation Ranking: Bringing Order to the Web Bringing Order to the Web Page L. , Brin S. , Motwani R. , Winograd T. Stanford Digital Library Technologies Project http://dbpubs.stanford.edu/pub/1999-66 Presented by Zheng Zhao Presented by Zheng Zhao Originally designed by Soumya Sanyal Originally designed by Soumya Sanyal http://ranger.uta.edu/~gdas/Courses/Spring2005/DBIR/slides/The%20PageRank%20Citation%20Ranking%20- http://ranger.uta.edu/~gdas/Courses/Spring2005/DBIR/slides/The%20PageRank%20Citation%20Ranking%20- %20Redone.ppt %20Redone.ppt
23
Embed
The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Page L., Brin S., Motwani R., Winograd.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The PageRank Citation Ranking: The PageRank Citation Ranking: Bringing Order to the WebBringing Order to the Web
Page L. , Brin S. , Motwani R. , Winograd T. Stanford Digital Library Technologies Projecthttp://dbpubs.stanford.edu/pub/1999-66
Presented by Zheng ZhaoPresented by Zheng Zhao
Originally designed by Soumya SanyalOriginally designed by Soumya Sanyal
Paper Citations and the Web : Motivation PageRank : Why it should be considered? More PageRank: Nuts and bolts PageRank Unleashed: Looking under the hood Convergence and Random Walks : Why does it
work? Implementation: Getting your hands dirty Personalized PageRank: The invisible source Applications: What wasn’t apparent already Conclusions
University of Texas at Arlington
Paper Citations and the Web : MotivationPaper Citations and the Web : Motivation
Academic Citations link to other well known papers
But they are peer reviewed and have quality control
Web of academic documents are homogeneous in their quality, usage, citation & length
Most web pages link to web pages as well
Quality measure of a web page is subjective to the user though
Importance of a page is a quantity that isn’t intuitively possible to capture
University of Texas at Arlington
Contd.Contd.
An user wants to see what is most applicable to her needs first.
The job of the retrieval system is to present the more relevant documents up front.
The notion of quality or relative importance of a web page magnifies
The average quality experienced by an user is higher than the average quality of the average web page. Notations Used:
• Backlinks (inedges) : Links that point to a certain page
• Forward Links (outedges): Links that emanate from that page
University of Texas at Arlington
PageRank : Why it should be considered?PageRank : Why it should be considered?
Think of a color palette – Colors are formed by the mixture of one
or more colors– The amount and intensity of each color
you mix ultimately governs the color of the final mixture not the number of colors !!!
Now think of a Web Page– A number of back links (inedges) point
to this webpage– Say a certain back link came from
Yahoo! and another came from an obscure home page. Think of the importance of the Yahoo! Page as
opposed to the importance of the ‘home page’. Now say the importance of the Yahoo! Page was
mapped to the amount (intensity) of one color and the ‘home page’ to another color
Importance of back links rather than their number.
+
+
University of Texas at Arlington
More PageRank: Nuts and boltsMore PageRank: Nuts and bolts
Say for any Web Page u the number of forward links is given by Fu and the number of back links be Bu and Nu=| Fu |
R() = Rank of page u ; c = Normalization Constant– Note: c < 1 to cover for
pages with no outgoing links
University of Texas at Arlington
Contd..Contd..
So what does the overall picture look like?
A is designated to be a matrix, u and v correspond to the columns of this matrix
Given that A is a matrix, and R be a vector over all the Web pages, the dominant eigenvector is the one associated with the maximal eigenvalue.
It can be found out by recursing the previous equation till the recurrence converges.– A set of eigenvalues form what is called the
eigenspace.
University of Texas at Arlington
Contd.. (A Walk Through Example)Contd.. (A Walk Through Example)
Lets take an example
AT=
University of Texas at Arlington
Contd..Contd..
Matrix NotationR = c A R = M R
c : eigenvalueR : eigenvector of A
A x = λ x| A - λI | x = 0
A =
R = Normalized =
University of Texas at Arlington
Contd.. (Markov Chains)Contd.. (Markov Chains)
Random surfer model– Description of a random walk through the Web graph– Interpreted as a transition matrix with asymptotic
probability that a surfer is currently browsing that page
– The above notion is fundamental to any Markovian System. For a discrete notion of the above, the following is assumed.
Rt = M Rt-1
M: transition matrix for a first-order Markov chain (stochastic) The question is does it converge to some sensible
solution (as t) regardless of the initial ranks ?
University of Texas at Arlington
Contd..(Issues..)Contd..(Issues..)
The above equation would converge were it not for a little problem
This problem is called the ‘Rank Sink’ Problem.– The sink accumulates rank, but never distributes it!
University of Texas at Arlington
Contd..()Contd..()
In general many Web pages don’t have either backlinks or forward links.
Results in dangling edges of the graph
no parent rank 0 – MT converges to a matrix whose last column is all zero
no children no solution– MT converges to zero matrix
University of Texas at Arlington
Contd..(More Random Surfer)Contd..(More Random Surfer)
How do we escape from this ?– A: We actually ‘escape’ from it.
Say a surfer is randomly clicking and hopping from one page to the other.
If this surfer keeps going back to the ‘same’ set of pages, she will get bored (in reality too) and try and ‘escape’ from this set of pages.
Hence, we associate an ‘escape’ factor E to account for this ‘boredom’.– How do we model this escape probability
We term this E to be a vector over all the web pages that accounts for each page’s escape probability.
University of Texas at Arlington
Contd..Contd..
Given this Escape vector, how do we associate this with the original model
In matrix notation where
It can be rewritten as
Hence
University of Texas at Arlington
PageRank Unleashed: Looking under the PageRank Unleashed: Looking under the hoodhood
• What can we say about d and ?• d1 is called the eigengap and it controls the rate of convergence• is the convergence threshold
The main algorithm :
University of Texas at Arlington
Convergence and Random Walks : Why does it Convergence and Random Walks : Why does it work?work?
Irreducible Aperiodic Markov Chains with a Primitive transition probability matrix
What is the issue all about?– We need a transition matrix model that is guaranteed
convergence and does indeed converge to a unique stationary distribution vector.
University of Texas at Arlington
Contd..Contd..
Addition of the escape vector E, allows us to make the original matrix A be both primitive and stochastic– This guarantees convergence
What about the addition of new links – Whether the link analysis algorithms based on
eigenvectors are stable in the sense that results don’t change significantly?
The connectivity of a portion of the graph is changed arbitrary– How will it affect the results of algorithms?
Ng et al. (2001) IJCAI and Bianchini et al. (2002) WWW’02
• It is possible to perturb a symmetric matrix by a quantity that grows as d1 that produces a constant perturbation of the dominant eigenvector
University of Texas at Arlington
Contd..Contd..
Convergence Experiment(s)– Expander graphs and d1 (every subset S has a
neighborhood bounded by some factor times |S|)– Rapidly mixing random walk : Convergence is guaranteed
in logarithmic time in the order of the size of the graph
University of Texas at Arlington
Implementation: Getting your hands dirtyImplementation: Getting your hands dirty
In 1998– 24 million web pages– Crawler builds an index of links– To do this in 5 days, 50 Web pages/second need to
be crawled– 11 is the average outdegree, 550 links/second– 75 million unique URL’s to be compared against– URL’s are hashed to unique integer ID– No dangling links are kept initially– Vector E will help in convergence issues also– Weights were kept for 75 million URLs @ 4
bytes/weight (300MB)– Access to link Database is linear since it is sorted