Top Banner
Presented By: - Chandrika B N
29

Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Dec 24, 2015

Download

Documents

Jean Singleton
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Presented By:- Chandrika B N

Page 2: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank Definition Random Surfer Model Dangling Links PageRank Implementation Convergence Searching with PageRAnk Personalized PageRank Application Conclusion

Page 3: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Technology OverviewRecognized the need for a new kind of server setup

Linked PC’s to quickly find each query’s answersThis resulted in: Faster Response Time

Greater Scalability Lower costs

Google uses more than 200 signals (including PageRank algorithm) to determine which pages are important

Google then performs hypertext-matching- Google Corporate Information

Page 4: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

- Google Corporate Information

Life of a Google Query

Page 5: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

The mechanism

•Web Crawler: Finds and retrieves pages on the web•Repository: web pages are compressed and stored here•Indexer: each index entry has a list of documents in which the term appears

and the location within the text where it occurs

Page 6: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

IntroductionWWW is very large and heterogeneous

The web pages are extremely diverse

Problem: How can the most relevant pages be ranked at the top?

Answer: Take advantage of the link structure of the Web to produce ranking of every web page known as PageRank

Page 7: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Link Structure of the Web

A and B are Backlinks of C

•Every page has some number of forward links (outedges) and backlinks (inedges)

•We can never know all the backlinks of a page, but we know all of its forward links

•Generally, highly linked pages are more “important”

Page 8: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

PageRank• PageRank - a method for computing a ranking for every web page

based on the graph of the web• A page has high rank if the sum of the ranks of its backlinks is high

• Page has many backlinks• Page has a few highly ranked backlinks

• Page rank is a link analysis algorithm that assigns a numerical weight that represents how important a page is on the web

• The web is democratic i.e., pages vote for pages

Google interprets a link from page A to page B as a vote, by page A, for page B.It also analyses the page that cast the vote.

A page is important if important pages refer to it

Page 9: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Simple Ranking Function: u: web pageBu: backlinksNu = |Fu| number of links from uc: factor used for normalization

The PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one

Simplified PageRank Calculation

Page 10: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Eigenvalue and EigenvectorEigenvalues and Eigenvectors are properties of a matrix In general, a matrix acts on a vector by changing both its

magnitude and directionHowever, a matrix may act on certain vectors by changing

only their magnitude, and leaving their direction unchanged – Eigenvector

A matrix acts on an eigenvector by multiplying its magnitude by a factor called the Eigenvalue

Given a linear transformation A, a non-zero vector x is defined to be an eigenvector of the transformation if it satisfies the eigenvalue equation

In this situation, the scalar λ is called an eigenvalue of A corresponding to the eigenvector x

Page 11: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Given a square matrix A, the eigenvalue eq can be expressed as

The eigenvector equation for A can be written as

λ is the eigenvalueSolving this eq we get λ = 1 and λ = 3

Example

A =

Page 12: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Considering first the eigenvalue λ = 3, we have

After matrix-multiplication

This can be represented as 2 linear equations: 2x + y = 3x and x + 2y = 3y

The equations can be reduced to x = yWe can choose any value for x. Taking x=1, we get y=1

Eigenvector with eigenvalue 1

Eigenvector with eigenvalue 3

Page 13: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Computing PageRank given a Directed Graph

The Transition matrix A =

We get the eigenvalue λ = 1

Calculating the eigenvector

Page 14: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

On substituting we get,

so the vector u is of the form

Choose v to be the unique eigenvector with the sum of all entries equal to 1

PageRank vector

Page 15: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Calculating the PageRankFinding the Eigenvalue and Eigenvector

Let Au,v = 1/Nu , if there is an edge from u to v

0, otherwiseIf R is a vector over the web pages,

then R = cAR where , R: eigenvector of A

c: eigenvalue

•Consider two web pages that point to each other but to no other page

•Suppose there is some web page which points to one of them, then

•During iteration, this loop will accumulate rank but will never distribute any rank

•This forms a trap called the RANK SINK. This can be overcome by introducing a Rank Source

Problem: Rank Sink

Page 16: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

PageRank Definition:Let E(u) be some vector over the Web pages that corresponds to a source of rank. Then, the PageRank of a set of Web pages is an assignment, R’, to the Web pages which satisfies

such that c is maximized and ||R’||1 = 1 (||R’||1 denotes the L1 norm of R’).

PageRank of document u

Number of outlinks from document v

PageRank of document vthat links to u

Normalizationfactor

Vector of web pages that the Surfer randomly jumps to

Page 17: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Computing PageRank

S: any vector over the web pages

Loop:

Calculate the Ri+1 vector using Ri

Calculate the normalizing factor

Find the vector Ri+1 using d

Find the norm of the difference of 2 vectors

while Loop until convergence

SR 0

ii ARR 1

111 ii RRd

dERR ii 11

ii RR 1

Page 18: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Random Surfer Model

The “Random surfer” simply keeps clicking on successive links at random

A Real Web Surfer will unlikely continue in a loop forever

The surfer periodically “gets bored” and jumps to another random page

Page 19: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Dangling Links

Links that point to any page with no outgoing links

They do not affect the ranking of any other page directly

Problem: It is not clear where their weight should be distributed

Solution: They can be removed from the system until all the PageRanks are calculated

Page 20: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

PageRank Implementation

Convert each URL into a unique integer IDSort the link structure by IDRemove the dangling linksMake an initial assignment of ranksIteratively compute PageRank until ConvergenceAdd the dangling links back Recompute the rankings

NOTE: After adding the dangling links back, we need to iterate as many times as was required to remove the dangling links

Page 21: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

ConvergencePR (322 Million Links): 52 iterationsPR (161 Million Links): 45 iterationsScaling factor is roughly linear in logn

Page 22: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Convergence

The web is an expander-like graph

A graph is said to be an expander if:Every subset of nodes S has a neighborhood that is

larger than some factor α times |S| α is called the expansion factor

A graph has a good expansion factor if and only if the largest eigenvalue is sufficiently larger than the second-largest eigenvalue

Page 23: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Searching with PageRank• Two search engines:

– Title-based search engine– Full text search engine

• Title-based search engine– Searches only the “Titles”– Finds all the web pages whose titles contain all the

query words– Sorts the results by PageRank– Very simple and cheap to implement– Title match ensures high precision, and PageRank

ensures high quality

• Full text search engine– Called Google– Examines all the words in every stored document and

also performs PageRank (Rank Merging)

Page 24: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Title-based search for University

Page 25: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

Personalized PageRank

Important component of PageRank calculation is EA vector over the web pages (used as source of rank)Powerful parameter to adjust the page ranks

E vector corresponds to the distribution of web pages that a random surfer periodically jumps to

Having an E vector that is uniform over all the web pages results in some web pages with many related links receiving an overly high rank eg: copyright page or forums General Search over the internet

Instead in Personalized PageRank E consists of a single web page

Page 26: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

ApplicationsEstimating Web Traffic

On analyzing the statistics, it was found that there are some sites that have a very high usage, but low PageRank.eg: Links to pirated software

PageRank as Backlink PredictorThe goal is to try to crawl the pages in as close to the optimal order as possible i.e., in the order of their rank.PageRank is a better predictor than citation counting

User Navigation: The PageRank ProxyThe user receives some information about the link before they click on itThis proxy can help users decide which links are more likely to be interesting

Page 27: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

ConclusionPageRank is a global ranking of all web pages base of

their location in the Web’s graph structure

PageRank uses information which is external to the Web pages – backlinks

Backlinks from important pages are more significant than backlinks from average pages

The structure of the Web graph is very useful for information retrieval tasks.

Page 28: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.

References L. Page, S. Brin, R. Motwani, T. Winograd. The PageRank Citation

Ranking: Bringing Order to the Web, 1998 L. Page and S. Brin. The anatomy of a large-scale hypertextual

web search engine, 1998 THE $25,000,000,000 EIGENVECTOR THE LINEAR ALGEBRA

BEHIND GOOGLE by KURT BRYAN AND TANYA LEISE Google Corporate Information:

http://www.google.com/corporate/tech.html http://en.wikipedia.org/wiki/PageRank http://en.wikipedia.org/wiki/

Eigenvalue,_eigenvector_and_eigenspace http://www.googleguide.com/google_works.html http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/

Lecture3/lecture3.html http://pr.efactory.de/

Page 29: Presented By: - Chandrika B N. Agenda Technology Overview Introduction Link Structure of the Web Simplified PageRank Eigenvalue and Eigenvector PageRank.