Top Banner
Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
41

Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Dec 16, 2015

Download

Documents

Malcolm Parrish
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Presented By:-Vikrant Khosla

Sridhar Kameswara Nemani

Authoritative Sources in aHyperlinked environment

Jon M. Kleinberg

Page 2: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

OutlineSearch on WWW – Problem in generalOverview of the authoritative approach

proposed by this paperConstructing a focused SubgraphComputing Hubs and AuthoritiesSimilar page QueriesMultiple Sets of Hubs and AuthoritiesDiffusion and generalizationEvaluationConclusion

Page 3: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

General ProblemHow to improve quality of search on

WWW?Quality of search requires human

evaluation due to the subjectivity inherent in notions such as relevance.

The WWW is a hypertext corpus of enormous complexity and information.

This paper aims to create link based model that consistently identifies relevant, authoritative WWW pages for broad search topics.

Page 4: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Understand Query Types

There is more than one type of query and the handling of each may require different techniques.

Type of queries:Specific queries

E.g. “Does Netscape support the JDK 1.1 code-signing API?”

Broad-topic queries

E.g. “Find information about the Java programming language.”

Similar page queriesExample: Find pages ‘similar ’ to honda.com

Page 5: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Difficulty in Handling query

Specific queries:Scarcity Problem- There are few pages

containing those information and it is difficult to determine the identity of those pages.

Broad topic queries:Abundance problem- The number of pages

that could reasonably be returned as relevant is far too large for a human user to digest.

Select a small set of the most “authoritative” or “definitive” ones from a huge collection of pages that are most relevant

Page 6: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Authoritative PagesGiven a particular page, how do we tell

whether it is authoritative?Problem is related to limitations of text based

analysis.Text based ranking function

E.g. For the “harvard”, www.harvard.edu is proper authoritative page but there may be lots of other web pages containing “harvard” more often.

Most popular Pages are not sufficiently self descriptiveUsually the term “search engine” doesn’t appear on search

engine home web pages of Yahoo, AltaVista, Excite etc.Honda or Toyota home pages hardly contain the term

“automobile manufacturer”.

Page 7: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Analysis of link structureHyperlinks encode a latent human judgment

which can be used to formulate a notion of authority.

Creation of a link represents a concrete indication of the following type of judgment

The creator of page p, by including a link to page q, has in some measure conferred authority on q.

Opportunity for the user to find potential authorities purely through the pages that point to them.

Potential Pitfalls of above concept Most links are created for navigational purposes.(eg: main-

menu, paid-adds) Difficult to balance between appropriate relevance and

popularity(eg: Yahoo)

Page 8: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Authorities and HubsAuthorities are pages that are recognized

as providing significant, trustworthy, and useful information on a topic.

Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities).

In-degree - Number of pointers to a page and is one simple measure of authority.

Out-degree - Number of pointers from a page to other pages.

Page 9: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Can we operate over entire WWW ?

Local approaches- deals with intranet and amount of data is much smaller as compared to WWW as a whole.

Clustering approach- dissects a heterogeneous population into subpopulations that in some way more cohesive, but underlying problem of filtering vast number of pages is still the same.

Authoritative approach- global nature Perform search on text based WWW search engine Distil broad topic from these pages via the

discovery of authority.

Page 10: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Overview of search steps

Page 11: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Overview Search stringText Search EngineAuthoritative Approach

Constructing focused subgraphComputing Hub & Authorities

Better quality search result

Page 12: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Constructing SubgraphThe collection V of hyperlinked pages can be

viewed as a directed graph G=(V,E):nodes correspond to pages, and a directed edge (p,q) ε E indicates the presence of a link from p to q.

Construct a focused subgraph (S ) of the WWW with the following properties:-S is relatively small (so that computation is

affordable)S is rich in relevant pages (so that its easier to

find good authority)S contains most (or many) of the strongest

authorities

Page 13: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

How to find SSet Q- set of all pages containing query string.Root set R- t highest ranked pages for the

query got from a text-based search engine. It satisfy property 1 & 2.

Problems with R: R is a subset of collection Q and Q does not satisfy

property 3. There are extremely few links between pages in R,

rendering it essentially “structureless”.Strong authority for query is quite likely to

be pointed to by at least one page in R. Construct Base set S by extend root set

R by including :- All pages linked to by pages in R All pages that link to a page in R

at most d

Page 14: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Subgraph algorithm

Page 15: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Observation & Heuristics

Heuristic 1: Delete all intrinsic links & keep all transverse links

Intrinsic links:• if the link is between pages with the same domain name.• Generally these are for navigation purposes.• Less informative and often contain repetitive information.

Transverse: if it is between pages with different domain names.

Heuristic 2: Delete pages having collusion or keep 4 to 8

Large number of pages from a single domain all point to a single page p.

Generally used for mass endorsement, advertisement etc.

Page 16: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Overview Search stringText Search EngineAuthoritative Approach

Constructing focused subgraphComputing Hub & Authorities

Better quality search result

Page 17: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Computing Hubs & Authorities

Simplest approach would be to order pages by in-degree

Problem: Nodes with highest in-degree in base set:-

might not necessarily be authorities & lack any thematic unity.

might simply be universally popular pages like yahoo, google, etc.

Page 18: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Computing Hubs & Authorities

Observation:Good sources of content (authorities)Good sources of links (hubs)True authority pages are pointed by a number

of good hubs.Mutually reinforcing relationship:

Hubs point to lots of authorities.Authorities are pointed to by lots of hubs

We will use the iterative algorithm to break this circularity.

Terms :Good hub: page that points to many good

authorities.Good authority: page pointed to by many

good hubs.

Page 19: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Overview of Algorithm

Page 20: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Iterative Algorithm An iterative algorithm

with each page p , we associatea non-negative authority weight x<p>

a non-negative hub weight y<p>

weights of each type are normalized so their squares sum to 1

The pages with larger x and y values have “better” authorities and hubs respectively.

Page 21: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Iterative Algorithm If p points to many pages with large x-

values, then it should receive a large y-valueIf p is pointed to by many pages with large

y-values, then it should receive a large x-valueInlinks Operation I: Outlinks Operation O:

Page 22: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Algorithm

Page 23: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Matrices Basics

Page 24: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

ObservationsAs one applies Iterate with arbitrary large

k , the vectors

Let G = (V , E ), with V = {p1 , p2 ,…, pn }, and let A denote the adjacency matrix of the graph G : the (i , j )th entry of A is 1 if (pi , pj ) is an edge of G , and is 0 otherwise.

x* is the principal eigenvector of ATA , and y* is the principal eigenvector of AAT

The convergence of Iterate is quite rapid (k =20 is sufficient)

Page 25: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Observations Any eigenvector algorithm can be used

to compute the fixed points X* and Y*Emphasizes the underlying motivation

of the approach by reinforcing I and O operations

Do not require to iterate I and O to convergenceCan start from initial vector X0 and Y0 and

computer using a fixed bound of I and O operations

Page 26: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Example: Mini Web

Page 27: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Example: Mini Web (Cont..)

Page 28: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Basic Results

Page 29: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

ObservationsJust “pure ” analysis of link structure

We ignored the text in searching for authoritative pages.

i.e., text-based search is just an initial setPages legitimately considered as

authoritative in the context of www without access to large- scale index of the wwwi.e., global analysis of the full www link structure

can be replaced by local method over small focused subgraph

This approach can replace local approaches used in intranet

Page 30: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Similar page queriesExample: Find pages ‘similar ’ to

honda.comUsing link structure to infer a notion of

“similarity” among pagesWe have found a page p that is of

interest and it’s an authoritative page on a topic. Can this help in finding similar pages?What do users of the WWW consider to be

related to p when they create pages and links ?

Page 31: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Similar page queriesPreviously our request to search engine

was: “Find t pages containing the string

Now our request to search engine is: “Find t pages pointing to p” Rp root setSp base setGp focused subgraph

Strongest authorities in the local region of the link structure near p are the potential broad-topic summary of pages related to p.

Page 32: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Results- Similar page queries

Page 33: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Multiple Sets of Hubs & AuthoritiesSeveral densely linked collections of hubs and

authorities within the same set.

Example: “jaguar” – has several different meanings. “randomized algorithms” – arises multiple technical

communities. “abortion” - -involves groups that may not be linked to

each other.

Clustering in presence of Abundance problem is needed.

Page 34: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Multiple Sets of Hubs & AuthoritiesThe non-principal Eigenvectors provide us a

way to extract additional densely linked collections of hubs and authorities.

Non-principal eigenvectors will have both positive and negative entries.

Often, the highly positive entries will correspond to a cluster of pages and negative entries to a different cluster.

Typically the two clusters will not be tightly intertwined. intertwined.

Page 35: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Jaguar ExampleAuthority principal eigenvector is primarily

about the Atari product.

In the positive end of the 2nd non-principal eigenvector, the pages are primarily about the Jacksonville Jaguars.

In the positive end of the 3rd non-principal eigenvector, the pages are primarily about the car.

Page 36: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Randomized Algorithms ExampleThe first non-principal eigenvector, positive

end returned home pages of theoretical computer scientists.

First non-principal eigenvector’s, negative end returns compendia of mathematical software.

In the negative end of the fourth non-principal eigenvector, the pages are primarily about wavelets.

Page 37: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Diffusion and GenerealizationThe query may not be sufficiently “broad.”

In this case there will not be enough highly relevant pages in the base set to extract a sufficiently dense sub-graph of relevant hubs and authorities.

When this occurs, the collection will often represent a broader topic, and the results will reflect a diffused version of the initial query.

Example: “WWW conferences” -> WWW resource pages. resource pages.

Page 38: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

In studies conducted in 1998 over 26 queries and 37 volunteers, Clever reported better authorities than Yahoo!, which in turn was better than Alta Vista.

Page 39: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

ConclusionNeed a way to distill a broad topic, for

which there may be millions of relevant pages

Provides a high quality results in context of what is available on the www globally

Operate without maintaining an index of the www or its link structure

It identifies the complex pattern of social organization on the www.

Page 40: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Referenceshttp://crystal.uta.edu/~gdas/Courses/websi

tepages/spring07DBIR.htmAn Intro to Information Retrieval by

Manning and RaghabanInformation Retrieval Data Structures

Algorithms - William B. FrakesRandom Walks in Ranking Query

Results in Semistructured Databases slides by Vagelis Hristidis

Page 41: Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg.

Thank You