Top Banner
Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger
48

Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Order Out of Chaos

Analyzing the Link Structure of the Web for Directory Compilation and Search.

Presented 19.11.2000

by Benjy Weinberger

Page 2: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

The Search Problem

• The WWW is a vast collection of information: over 1 billion text pages plus a multitude of multimedia files. Over a million new resources are added every day.

• How do we find the information we need in such a large collection?

• Search is the most common activity on the web after email.

Page 3: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Types of Solutions

• There are basically two types of solution to the “search problem”:– Directories are static lists of web pages,

ordered in a topic taxonomy.– Search Engines attempt to match web sites to a

user’s query in real time.

• The distinction between the two types is becoming more and more fuzzy.

Page 4: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Types of Searches

• Two main kinds of searches:– Specific queries. E.g. “does Netscape 4.0

support the Java 1.1 code-signing API?”– Broad-topic queries. E.g. “find out information

about Che Guevara.”

• Other types of searches exist, such as:– Similar-page queries. E.g. “find pages ‘similar’

to www.cnn.com.”

Page 5: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Different Queries - Different Challenges

• Specific Queries: The difficulty is mostly that there are very few pages with the required information - the “needle in a haystack” problem.

• Broad-topic Queries: The difficulty is that there are too many relevant pages for humans to digest.

Page 6: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Broad-topic Queries

• Center around the notion of an authority - a page containing high-quality, authoritative information on the relevant topic.

• Problem: we are asking a computer to make a judgment on abstract and subjective qualities of “relevance” and “authority”.

Page 7: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Manual Solutions

• Manually edited catalogues of pages, ordered in a taxonomy of topics.

• Problems: – Completely unscalable. Only a tiny fraction of

the web can be catalogued.– Subjective. Decisions made by individual

editors.

Page 8: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Manual Solutions

• Nonetheless, solution is highly popular.

• Yahoo is still the #1 site on the web in terms of visits.

• The Open Directory Project is has over 2,000,000 sites in 250,000 categories, edited by tens of thousands of volunteers.

Page 9: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Automated Solutions - 1st Wave

• “Relevance” assumed to be associated with existence of keywords.

• “Authority” assumed to be associated with frequency and placement of keywords. – The more a page mentions the word the higher

that page’s score.– Font size, proximity to other keywords and

place in page may increase score.

Page 10: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Automated Solutions - 1st Wave

• Solution implemented by search engines such as AltaVista, HotBot, Lycos, Northern Light and many others.

• These solutions are inadequate:– The determination of “relevance” is inaccurate.

– The determination of “authority” is inaccurate.

– Don’t work for non-textual resources.

• Result: users still end up wading through thousands of results.

Page 11: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

The Web as a Graph

• But ... the web is more than just a set of pages. The hyperlinks between pages induce a digraph structure on the web.

• Hyperlinks are created by human site authors and therefore can imply some human determination of authority.

• But how can we mine this information for search?

Page 12: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Naive Approach

• Determine relevance by keyword matching.

• Determine authority by number of inbound links.

• Heuristic:– Find all pages matching keywords.– Rank them by in-degree.

Page 13: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Naive Approach - Problems

• Good authorities may not contain keywords.– Example: honda.com, toyota.com, ford.com do

not contain the term “car manufacturers”.

• Doesn’t look at who is linking to the page.

• A highly popular page will be an authority on ANY query string it contains.– Example: cnn.com, yahoo.com, amazon.com.

Page 14: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Where Do We Stand

• The hyperlink structure of the web is constructed by the deliberate annotations of millions of web site authors.

• How can we mine this seemingly random link structure for underlying meaning?

Page 15: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Short Mathematical Interlude

• Given a square matrix A, an eigenvalue is a scalar with the property that there exists a vector such that . The vector is called an eigenvector.

• The normalized eigenvector corresponding to the largest eigenvalue is called the principal eigenvector.

x

0x

xxA

Page 16: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Short Mathematical Interlude

• A stochastic matrix is a matrix with non-negative entries such that the entries in each row sum to 1.

• A stochastic matrix always has a principal eigenvector (with eigenvalue 1):

All eigenvalues are <= 1 in absolute value.

1

1

1

1

1A

Page 17: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Short Mathematical Interlude

• For any matrix A, and are symmetric.

• A symmetric real n-by-n matrix has n real eigenvalues (with multiplicity).

• For any vector x not orthogonal to the principal eigenvector y,

TAA AAT

yxA

xAnn

n

Page 18: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Short Mathematical Interlude

• The adjacency matrix of a directed graph G is the matrix A such that:

)G(E)j,i(0

)G(E)j,i(1A j,i

Page 19: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Google

• High-quality, highly popular search engine.

• Developed by Sergey Brin and Larry Page at Stanford. Later became a commercial venture.

• Combines new ranking algorithm with state-of-the-art engineering solutions for scalability and performance.

Page 20: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Google

• Offline processing assigns each site on the web a PageRank - a global ranking based on link structure analysis.

• Relevance determined by keywords.

• Authority determined by PageRank.

Page 21: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

PageRank

• Intuitive description: a page has high rank if the sum of the ranks of pages pointing to it is high.

• This is a circular definition, represented mathematically by:

where n is the total number of pages, deg(q) is q’s out degree and d is a damping factor.

n

d1

)qdeg(

)q(PRd)p(PR

pq

Page 22: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

PageRank

• PR is a probability vector on the web pages (0 <= PR(p) <= 1 and the sum is 1).

• PR corresponds to a random surfer model: – A surfer clicks at random through the web,

without hitting “back”. When she gets bored, which happens with probability 1-d, she jumps to a random page.

– PR(p) is the probability of hitting page p in this “random surf”.

Page 23: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

PageRank

• We define a matrix A for which:

Note that A is a stochastic matrix.• The PageRank vector is an eigenvector of

ji0

ji)ideg(1A j,i

11

11

J,JdABn

d1T

Page 24: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

PageRank

• It is not hard to see, using the properties of stochastic matrices, that B has a principal eigenvector.

• We can initialize PR to all ones and apply B iteratively until it converges to this eigenvector.

Page 25: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Google

• Advantages:– Superior results to other engines.– Responds to any query, not just predetermined

topics.

• Disadvantages:– PageRank is global, not contextual.– Discriminates against “orphan” pages.

Page 26: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

HITS

• HITS - Developed by Jon Kleinberg et al within the CLEVER project at IBM Almaden.

• Introduced the twin notions of Hubs and Authorities.

Page 27: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

HITS - Hubs & Authorities

• An authority is a web site containing high-quality, highly-relevant information (with respect to the given search query).

• A hub is a web site that points to many related authorities.

• Intuitively, hubs help “pull” authorities together into a dense cluster of related sites.

Page 28: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

HITS

• Hub (resp. authority) weights determine how good a hub (resp. authority) a site is.– The idea is to first find a focused subgraph of the

web relevant to the specified topic.– Then the link structure of this subgraph is analyzed

to find the hub and authority weights for each site, thus identifying the top authorities for this topic.

– Processing requirement per topic implies that uses are mainly for directory compilation.

Page 29: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

HITS

• The focused subgraph is created by first taking the t highest-ranked pages from a text-based search engine as a root set R.

• R is expanded into the base set S by taking all sites pointing to or pointed at by a site in R.

• Note that while R may fail to contain some “important” authorities, S will probably contain them.

Page 30: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

HITS

• We have a circular definition: – A good authority is a site pointed to by many

good hubs.– A good hub is a site that points to many good

authorities.

• This is the mutually reinforcing relationship.

• How can we break this circularity?

Page 31: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

HITS

• A static mathematical formulation:– For a web site p let h(p) be its hub weight and

a(p) be its authority weight. We take the vectors a, h to be normalized.

– We would like to find vectors a , h such that (up to normalization):

qppq

)q(a)p(h,)q(h)p(a

Page 32: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

HITS

• If A is the adjacency matrix of the graph in question (the focused subgraph) then the constraints can be written as:

By substitution we see that a is defined by:hA

hAa,

Aa

Aah

T

T

AAB,Ba

Baa T

Page 33: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

HITS

• In other words, a is an eigenvector of B:

• B is the co-citation matrix: B(i,j) is the number of sites that jointly point to both i and j.

• B is symmetric and has n orthogonal unit eigenvectors. Which one do we take?

Ba,aBa

Page 34: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

HITS

• Let’s look at the problem dynamically:– We initialize a(p) = h(p) = 1 for all p.– We iterate the following operations:

And renormalize after each iteration.

ATAB,Baa

Aah

)q(a)p(h

hAa

)q(h)p(aqp

T

pq

Page 35: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

HITS

• The eigenvectors of B are precisely the stationary points of this process.

• For any stationary point a, and therefore the authority weights are reinforced by the factor .

• Conclusion: The principal eigenvector represents the “densest cluster” within the focused subgraph.

aBa

Page 36: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

HITS

• As we have said, by initializing a(p)=h(p)=1, a will converge to the principal eigenvector of B. – Initializing differently may lead to convergence

to a different eigenvector.– In practice convergence is achieved after only

10-20 iterations.

Page 37: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

HITS

• The non-principal eigenvectors may represent other, less dense subclusters of the focused subgraph.

• Other subclusters can occur when:– A query string has several meanings– A concept is relevant to several communities.– The topic is highly polarized, involving groups that

are not likely to link to one another (such as “abortion”).

Page 38: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

HITS

• Unlike the principal eigenvector, the other eigenvectors have both positive and negative entries.

• Each non-principal eigenvector provides us with two densely connected clusters: The “most positive” and “most negative” entries of the vector.

• The larger the eigenvalue, the more “relevant” the corresponding eigenvector.

Page 39: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Spectral Filtering

• A generalization of HITS.

• Developed by Chakrabarti, Dom, Gibson, Raghavan and others. Some of these worked with Kleinberg on CLEVER at IBM Almaden.

• Used commercially in Quiver to mine community knowledge.

Page 40: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Spectral Filtering

• We have two kinds of entities: hubs and authorities. These can both be of the same type (such as web sites, cited documents) or of different types (people and products, bookmark folders and web sites).

• Note that we distinguish, a priori, between hubs and authorities.

Page 41: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Spectral Filtering

• The link structure consists of directed links from hubs to authorities.

• In mathematical terms we have a bipartite digraph (H;A;E) with all edges directed from H to A.

Page 42: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Spectral Filtering - Affinities

• For each hub i and authority j let a(i,j) >= 0 be the affinity of i for j.

• E.g: The percentage of common terms between documents or the number of visits a user makes to a bookmark.

• A=a(i,j) is the affinity matrix.

Page 43: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Spectral Filtering

• In SF the mutually reinforcing relationship becomes:

Or:

qp

q,ppq

p,q )q(aa)p(h,)q(ha)p(a

Aah,hAa T

Page 44: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Spectral Filtering

• Note that HITS is simply SF with each web site represented as both a hub and an authority, and with the following simple affinities:

ji0

ji1a j,i

Page 45: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Spectral Filtering

• The spectral analysis used in HITS applies here, and the principal eigenvector is used to determine the highest ranked authorities.

Page 46: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Spectral Filtering

• Notes:– The links from hubs to authorities represent the

opinions of many people regarding the authorities.

– SF mines this information for results reflecting the opinions of these people as a group.

– The SF algorithm “pulls” the densely linked cluster together.

– SF is relatively insensitive to “noise”.

Page 47: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

References

[1] J.Kleinberg. Authoritative Sources in a Hyperlinked Environment. Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 1998.

[2] S.Chakrabarti, B.Dom, D.Gibson, R.Kumar, P.Raghavan, S.Rajagopalan, A.Tomkins. Spectral Filtering for Resource Discovery. ACM SIGIR Workshop on Hypertext Information Retrieval on the Web, 1998.

[3] S.Brin, L.Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proc. 7th International WWW Conference, 1998.

[4] S.Brin, R.Motwani,L.Page,T.Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Manuscript.

Page 48: Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

End

• Thank you for listening!