Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Presented By:-Vikrant Khosla

Sridhar Kameswara Nemani

Authoritative Sources in aHyperlinked environment

Jon M. Kleinberg

OutlineSearch on WWW – Problem in generalOverview of the authoritative approach

proposed by this paperConstructing a focused SubgraphComputing Hubs and AuthoritiesSimilar page QueriesMultiple Sets of Hubs and AuthoritiesDiffusion and generalizationEvaluationConclusion

General ProblemHow to improve quality of search on

WWW?Quality of search requires human

evaluation due to the subjectivity inherent in notions such as relevance.

The WWW is a hypertext corpus of enormous complexity and information.

This paper aims to create link based model that consistently identifies relevant, authoritative WWW pages for broad search topics.

Understand Query Types

There is more than one type of query and the handling of each may require different techniques.

Type of queries:Specific queries

E.g. “Does Netscape support the JDK 1.1 code-signing API?”

Broad-topic queries

E.g. “Find information about the Java programming language.”

Similar page queriesExample: Find pages ‘similar ’ to honda.com

Difficulty in Handling query

Specific queries:Scarcity Problem- There are few pages

containing those information and it is difficult to determine the identity of those pages.

Broad topic queries:Abundance problem- The number of pages

that could reasonably be returned as relevant is far too large for a human user to digest.

Select a small set of the most “authoritative” or “definitive” ones from a huge collection of pages that are most relevant

Authoritative PagesGiven a particular page, how do we tell

whether it is authoritative?Problem is related to limitations of text based

analysis.Text based ranking function

E.g. For the “harvard”, www.harvard.edu is proper authoritative page but there may be lots of other web pages containing “harvard” more often.

Most popular Pages are not sufficiently self descriptiveUsually the term “search engine” doesn’t appear on search

engine home web pages of Yahoo, AltaVista, Excite etc.Honda or Toyota home pages hardly contain the term

“automobile manufacturer”.

Analysis of link structureHyperlinks encode a latent human judgment

which can be used to formulate a notion of authority.

Creation of a link represents a concrete indication of the following type of judgment

The creator of page p, by including a link to page q, has in some measure conferred authority on q.

Opportunity for the user to find potential authorities purely through the pages that point to them.

Potential Pitfalls of above concept Most links are created for navigational purposes.(eg: main-

menu, paid-adds) Difficult to balance between appropriate relevance and

popularity(eg: Yahoo)

Authorities and HubsAuthorities are pages that are recognized

as providing significant, trustworthy, and useful information on a topic.

Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities).

In-degree - Number of pointers to a page and is one simple measure of authority.

Out-degree - Number of pointers from a page to other pages.

Can we operate over entire WWW ?

Local approaches- deals with intranet and amount of data is much smaller as compared to WWW as a whole.

Clustering approach- dissects a heterogeneous population into subpopulations that in some way more cohesive, but underlying problem of filtering vast number of pages is still the same.

Authoritative approach- global nature Perform search on text based WWW search engine Distil broad topic from these pages via the

discovery of authority.

Overview of search steps

Overview Search stringText Search EngineAuthoritative Approach

Constructing focused subgraphComputing Hub & Authorities

Better quality search result

Constructing SubgraphThe collection V of hyperlinked pages can be

viewed as a directed graph G=(V,E):nodes correspond to pages, and a directed edge (p,q) ε E indicates the presence of a link from p to q.

Construct a focused subgraph (S ) of the WWW with the following properties:-S is relatively small (so that computation is

affordable)S is rich in relevant pages (so that its easier to

find good authority)S contains most (or many) of the strongest

authorities

How to find SSet Q- set of all pages containing query string.Root set R- t highest ranked pages for the

query got from a text-based search engine. It satisfy property 1 & 2.

Problems with R: R is a subset of collection Q and Q does not satisfy

property 3. There are extremely few links between pages in R,

rendering it essentially “structureless”.Strong authority for query is quite likely to

be pointed to by at least one page in R. Construct Base set S by extend root set

R by including :- All pages linked to by pages in R All pages that link to a page in R

at most d

Subgraph algorithm

Observation & Heuristics

Heuristic 1: Delete all intrinsic links & keep all transverse links

Intrinsic links:• if the link is between pages with the same domain name.• Generally these are for navigation purposes.• Less informative and often contain repetitive information.

Transverse: if it is between pages with different domain names.

Heuristic 2: Delete pages having collusion or keep 4 to 8

Large number of pages from a single domain all point to a single page p.

Generally used for mass endorsement, advertisement etc.

Overview Search stringText Search EngineAuthoritative Approach

Constructing focused subgraphComputing Hub & Authorities

Better quality search result

Computing Hubs & Authorities

Simplest approach would be to order pages by in-degree

Problem: Nodes with highest in-degree in base set:-

might not necessarily be authorities & lack any thematic unity.

might simply be universally popular pages like yahoo, google, etc.

Computing Hubs & Authorities

Observation:Good sources of content (authorities)Good sources of links (hubs)True authority pages are pointed by a number

of good hubs.Mutually reinforcing relationship:

Hubs point to lots of authorities.Authorities are pointed to by lots of hubs

We will use the iterative algorithm to break this circularity.

Terms :Good hub: page that points to many good

authorities.Good authority: page pointed to by many

good hubs.

Overview of Algorithm

Iterative Algorithm An iterative algorithm

with each page p , we associatea non-negative authority weight x<p>

a non-negative hub weight y<p>

weights of each type are normalized so their squares sum to 1

The pages with larger x and y values have “better” authorities and hubs respectively.

Iterative Algorithm If p points to many pages with large x-

values, then it should receive a large y-valueIf p is pointed to by many pages with large

y-values, then it should receive a large x-valueInlinks Operation I: Outlinks Operation O:

Algorithm

Matrices Basics

ObservationsAs one applies Iterate with arbitrary large

k , the vectors

Let G = (V , E ), with V = {p1 , p2 ,…, pn }, and let A denote the adjacency matrix of the graph G : the (i , j )th entry of A is 1 if (pi , pj ) is an edge of G , and is 0 otherwise.

x* is the principal eigenvector of ATA , and y* is the principal eigenvector of AAT

The convergence of Iterate is quite rapid (k =20 is sufficient)

Observations Any eigenvector algorithm can be used

to compute the fixed points X* and Y*Emphasizes the underlying motivation

of the approach by reinforcing I and O operations

Do not require to iterate I and O to convergenceCan start from initial vector X0 and Y0 and

computer using a fixed bound of I and O operations

Example: Mini Web

Example: Mini Web (Cont..)

Basic Results

ObservationsJust “pure ” analysis of link structure

We ignored the text in searching for authoritative pages.

i.e., text-based search is just an initial setPages legitimately considered as

authoritative in the context of www without access to large- scale index of the wwwi.e., global analysis of the full www link structure

can be replaced by local method over small focused subgraph

This approach can replace local approaches used in intranet

Similar page queriesExample: Find pages ‘similar ’ to

honda.comUsing link structure to infer a notion of

“similarity” among pagesWe have found a page p that is of

interest and it’s an authoritative page on a topic. Can this help in finding similar pages?What do users of the WWW consider to be

related to p when they create pages and links ?

Similar page queriesPreviously our request to search engine

was: “Find t pages containing the string

Now our request to search engine is: “Find t pages pointing to p” Rp root setSp base setGp focused subgraph

Strongest authorities in the local region of the link structure near p are the potential broad-topic summary of pages related to p.

Results- Similar page queries

Multiple Sets of Hubs & AuthoritiesSeveral densely linked collections of hubs and

authorities within the same set.

Example: “jaguar” – has several different meanings. “randomized algorithms” – arises multiple technical

communities. “abortion” - -involves groups that may not be linked to

each other.

Clustering in presence of Abundance problem is needed.

Multiple Sets of Hubs & AuthoritiesThe non-principal Eigenvectors provide us a

way to extract additional densely linked collections of hubs and authorities.

Non-principal eigenvectors will have both positive and negative entries.

Often, the highly positive entries will correspond to a cluster of pages and negative entries to a different cluster.

Typically the two clusters will not be tightly intertwined. intertwined.

Jaguar ExampleAuthority principal eigenvector is primarily

about the Atari product.

In the positive end of the 2nd non-principal eigenvector, the pages are primarily about the Jacksonville Jaguars.

In the positive end of the 3rd non-principal eigenvector, the pages are primarily about the car.

Randomized Algorithms ExampleThe first non-principal eigenvector, positive

end returned home pages of theoretical computer scientists.

First non-principal eigenvector’s, negative end returns compendia of mathematical software.

In the negative end of the fourth non-principal eigenvector, the pages are primarily about wavelets.

Diffusion and GenerealizationThe query may not be sufficiently “broad.”

In this case there will not be enough highly relevant pages in the base set to extract a sufficiently dense sub-graph of relevant hubs and authorities.

When this occurs, the collection will often represent a broader topic, and the results will reflect a diffused version of the initial query.

Example: “WWW conferences” -> WWW resource pages. resource pages.

In studies conducted in 1998 over 26 queries and 37 volunteers, Clever reported better authorities than Yahoo!, which in turn was better than Alta Vista.

ConclusionNeed a way to distill a broad topic, for

which there may be millions of relevant pages

Provides a high quality results in context of what is available on the www globally

Operate without maintaining an index of the www or its link structure

It identifies the complex pattern of social organization on the www.

Referenceshttp://crystal.uta.edu/~gdas/Courses/websi

tepages/spring07DBIR.htmAn Intro to Information Retrieval by

Manning and RaghabanInformation Retrieval Data Structures

Algorithms - William B. FrakesRandom Walks in Ranking Query

Results in Semistructured Databases slides by Vagelis Hristidis

Thank You

Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Documents

authoritative pages

pages similar

number of pages

authoritative www pages

popular pages

index pages

relevant slide

toyota home pages