Presented By:- Vikrant Khosla Sridhar Kameswara Nemani Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Dec 16, 2015
Presented By:-Vikrant Khosla
Sridhar Kameswara Nemani
Authoritative Sources in aHyperlinked environment
Jon M. Kleinberg
OutlineSearch on WWW – Problem in generalOverview of the authoritative approach
proposed by this paperConstructing a focused SubgraphComputing Hubs and AuthoritiesSimilar page QueriesMultiple Sets of Hubs and AuthoritiesDiffusion and generalizationEvaluationConclusion
General ProblemHow to improve quality of search on
WWW?Quality of search requires human
evaluation due to the subjectivity inherent in notions such as relevance.
The WWW is a hypertext corpus of enormous complexity and information.
This paper aims to create link based model that consistently identifies relevant, authoritative WWW pages for broad search topics.
Understand Query Types
There is more than one type of query and the handling of each may require different techniques.
Type of queries:Specific queries
E.g. “Does Netscape support the JDK 1.1 code-signing API?”
Broad-topic queries
E.g. “Find information about the Java programming language.”
Similar page queriesExample: Find pages ‘similar ’ to honda.com
Difficulty in Handling query
Specific queries:Scarcity Problem- There are few pages
containing those information and it is difficult to determine the identity of those pages.
Broad topic queries:Abundance problem- The number of pages
that could reasonably be returned as relevant is far too large for a human user to digest.
Select a small set of the most “authoritative” or “definitive” ones from a huge collection of pages that are most relevant
Authoritative PagesGiven a particular page, how do we tell
whether it is authoritative?Problem is related to limitations of text based
analysis.Text based ranking function
E.g. For the “harvard”, www.harvard.edu is proper authoritative page but there may be lots of other web pages containing “harvard” more often.
Most popular Pages are not sufficiently self descriptiveUsually the term “search engine” doesn’t appear on search
engine home web pages of Yahoo, AltaVista, Excite etc.Honda or Toyota home pages hardly contain the term
“automobile manufacturer”.
Analysis of link structureHyperlinks encode a latent human judgment
which can be used to formulate a notion of authority.
Creation of a link represents a concrete indication of the following type of judgment
The creator of page p, by including a link to page q, has in some measure conferred authority on q.
Opportunity for the user to find potential authorities purely through the pages that point to them.
Potential Pitfalls of above concept Most links are created for navigational purposes.(eg: main-
menu, paid-adds) Difficult to balance between appropriate relevance and
popularity(eg: Yahoo)
Authorities and HubsAuthorities are pages that are recognized
as providing significant, trustworthy, and useful information on a topic.
Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities).
In-degree - Number of pointers to a page and is one simple measure of authority.
Out-degree - Number of pointers from a page to other pages.
Can we operate over entire WWW ?
Local approaches- deals with intranet and amount of data is much smaller as compared to WWW as a whole.
Clustering approach- dissects a heterogeneous population into subpopulations that in some way more cohesive, but underlying problem of filtering vast number of pages is still the same.
Authoritative approach- global nature Perform search on text based WWW search engine Distil broad topic from these pages via the
discovery of authority.
Overview of search steps
Overview Search stringText Search EngineAuthoritative Approach
Constructing focused subgraphComputing Hub & Authorities
Better quality search result
Constructing SubgraphThe collection V of hyperlinked pages can be
viewed as a directed graph G=(V,E):nodes correspond to pages, and a directed edge (p,q) ε E indicates the presence of a link from p to q.
Construct a focused subgraph (S ) of the WWW with the following properties:-S is relatively small (so that computation is
affordable)S is rich in relevant pages (so that its easier to
find good authority)S contains most (or many) of the strongest
authorities
How to find SSet Q- set of all pages containing query string.Root set R- t highest ranked pages for the
query got from a text-based search engine. It satisfy property 1 & 2.
Problems with R: R is a subset of collection Q and Q does not satisfy
property 3. There are extremely few links between pages in R,
rendering it essentially “structureless”.Strong authority for query is quite likely to
be pointed to by at least one page in R. Construct Base set S by extend root set
R by including :- All pages linked to by pages in R All pages that link to a page in R
at most d
Subgraph algorithm
Observation & Heuristics
Heuristic 1: Delete all intrinsic links & keep all transverse links
Intrinsic links:• if the link is between pages with the same domain name.• Generally these are for navigation purposes.• Less informative and often contain repetitive information.
Transverse: if it is between pages with different domain names.
Heuristic 2: Delete pages having collusion or keep 4 to 8
Large number of pages from a single domain all point to a single page p.
Generally used for mass endorsement, advertisement etc.
Overview Search stringText Search EngineAuthoritative Approach
Constructing focused subgraphComputing Hub & Authorities
Better quality search result
Computing Hubs & Authorities
Simplest approach would be to order pages by in-degree
Problem: Nodes with highest in-degree in base set:-
might not necessarily be authorities & lack any thematic unity.
might simply be universally popular pages like yahoo, google, etc.
Computing Hubs & Authorities
Observation:Good sources of content (authorities)Good sources of links (hubs)True authority pages are pointed by a number
of good hubs.Mutually reinforcing relationship:
Hubs point to lots of authorities.Authorities are pointed to by lots of hubs
We will use the iterative algorithm to break this circularity.
Terms :Good hub: page that points to many good
authorities.Good authority: page pointed to by many
good hubs.
Overview of Algorithm
Iterative Algorithm An iterative algorithm
with each page p , we associatea non-negative authority weight x<p>
a non-negative hub weight y<p>
weights of each type are normalized so their squares sum to 1
The pages with larger x and y values have “better” authorities and hubs respectively.
Iterative Algorithm If p points to many pages with large x-
values, then it should receive a large y-valueIf p is pointed to by many pages with large
y-values, then it should receive a large x-valueInlinks Operation I: Outlinks Operation O:
Algorithm
Matrices Basics
ObservationsAs one applies Iterate with arbitrary large
k , the vectors
Let G = (V , E ), with V = {p1 , p2 ,…, pn }, and let A denote the adjacency matrix of the graph G : the (i , j )th entry of A is 1 if (pi , pj ) is an edge of G , and is 0 otherwise.
x* is the principal eigenvector of ATA , and y* is the principal eigenvector of AAT
The convergence of Iterate is quite rapid (k =20 is sufficient)
Observations Any eigenvector algorithm can be used
to compute the fixed points X* and Y*Emphasizes the underlying motivation
of the approach by reinforcing I and O operations
Do not require to iterate I and O to convergenceCan start from initial vector X0 and Y0 and
computer using a fixed bound of I and O operations
Example: Mini Web
Example: Mini Web (Cont..)
Basic Results
ObservationsJust “pure ” analysis of link structure
We ignored the text in searching for authoritative pages.
i.e., text-based search is just an initial setPages legitimately considered as
authoritative in the context of www without access to large- scale index of the wwwi.e., global analysis of the full www link structure
can be replaced by local method over small focused subgraph
This approach can replace local approaches used in intranet
Similar page queriesExample: Find pages ‘similar ’ to
honda.comUsing link structure to infer a notion of
“similarity” among pagesWe have found a page p that is of
interest and it’s an authoritative page on a topic. Can this help in finding similar pages?What do users of the WWW consider to be
related to p when they create pages and links ?
Similar page queriesPreviously our request to search engine
was: “Find t pages containing the string
Now our request to search engine is: “Find t pages pointing to p” Rp root setSp base setGp focused subgraph
Strongest authorities in the local region of the link structure near p are the potential broad-topic summary of pages related to p.
Results- Similar page queries
Multiple Sets of Hubs & AuthoritiesSeveral densely linked collections of hubs and
authorities within the same set.
Example: “jaguar” – has several different meanings. “randomized algorithms” – arises multiple technical
communities. “abortion” - -involves groups that may not be linked to
each other.
Clustering in presence of Abundance problem is needed.
Multiple Sets of Hubs & AuthoritiesThe non-principal Eigenvectors provide us a
way to extract additional densely linked collections of hubs and authorities.
Non-principal eigenvectors will have both positive and negative entries.
Often, the highly positive entries will correspond to a cluster of pages and negative entries to a different cluster.
Typically the two clusters will not be tightly intertwined. intertwined.
Jaguar ExampleAuthority principal eigenvector is primarily
about the Atari product.
In the positive end of the 2nd non-principal eigenvector, the pages are primarily about the Jacksonville Jaguars.
In the positive end of the 3rd non-principal eigenvector, the pages are primarily about the car.
Randomized Algorithms ExampleThe first non-principal eigenvector, positive
end returned home pages of theoretical computer scientists.
First non-principal eigenvector’s, negative end returns compendia of mathematical software.
In the negative end of the fourth non-principal eigenvector, the pages are primarily about wavelets.
Diffusion and GenerealizationThe query may not be sufficiently “broad.”
In this case there will not be enough highly relevant pages in the base set to extract a sufficiently dense sub-graph of relevant hubs and authorities.
When this occurs, the collection will often represent a broader topic, and the results will reflect a diffused version of the initial query.
Example: “WWW conferences” -> WWW resource pages. resource pages.
In studies conducted in 1998 over 26 queries and 37 volunteers, Clever reported better authorities than Yahoo!, which in turn was better than Alta Vista.
ConclusionNeed a way to distill a broad topic, for
which there may be millions of relevant pages
Provides a high quality results in context of what is available on the www globally
Operate without maintaining an index of the www or its link structure
It identifies the complex pattern of social organization on the www.
Referenceshttp://crystal.uta.edu/~gdas/Courses/websi
tepages/spring07DBIR.htmAn Intro to Information Retrieval by
Manning and RaghabanInformation Retrieval Data Structures
Algorithms - William B. FrakesRandom Walks in Ranking Query
Results in Semistructured Databases slides by Vagelis Hristidis
Thank You