Top Banner
Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson
22

Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Mar 28, 2015

Download

Documents

Amelia Black
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Finding related pages in the World Wide Web

A review by:

Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson

Page 2: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Content • Introduction

• Algorithms

– Companion

– Co-citation

– Netscape’s

• Evaluations

• Critique

• Conclusion

Page 3: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Introduction Searching on the World Wide Web

• Common search tools include Google, Yahoo

Traditional Approach

• Keyword Query based

• Need to specify your information needs by giving relevant keywords

• Prone to errors!

Question!

What do I do if I don’t know exactly what I am looking for?

Page 4: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Introduction • Another Way…

– Use URL as search input instead of a phrase of text

e.g. www.nytimes.com

• What are the requirements?– Fast

– High precision

– Little input data

Page 5: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Introduction How does it work? - Web graph structure

– Proposed two algorithms:

Companion• Derived from HITS (Hyperlink Induced Topic Search)

algorithm proposed by Kleinberg for ranking search queries.

• Makes use of weights, hub and authority scores.

Co-citation• Finds pages that are frequently co-cited with an input URL u.

Sites A,B,CSites

X,Y,Zu

Found X,Y,Z

Page 6: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Companion Algorithm • Takes in a starting URL u as input e.g.

www.awebsite.com• Made up of 4 steps:

– Building the vicinity graph of u– Contract duplicates and near-duplicates in the

graph– Compute edge weights based on host to host

connection– Compute a hub score and a authority score for

each node in the graph and return the top ranked authority nodes

Page 7: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Companion Algorithm • Uses 5 values* to help

determine relevant pages:• Go Back (B): How many parent

sites the website has i.e. going from u1 to p1

• Back-Forward (BF): How many child sites the parent has i.e. going from u1 to p2 then to u2 (or u1)

• Forward (F): How many children the site has (pages it links to) i.e. u1 to c1

• Forward-Back (FB): How many parent sites the children have

i.e. u1 to c1 to u3 • STOP list: websites considered

not to be relevant to the pages content

p

1

u

1

c

1

p

2

u

2

c

2

hyperlinks

u

3

STOP List:•http://validator.w3.org/check?uri=referer•www.microsoft.com/ie/dowload.html•www.yahoo.com

*These values are determined before the algorithm is executed

A Web-Graph

website

Page 8: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Companion Algorithm• Step 1 – Building the vicinity graph of u• If u is part of the STOP list then it is ignored, otherwise all other sites in the list

will be ignored

p1

c1

p2

u2

c2

u3

Vicinity graph after step 1

Page 9: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Companion Algorithm

Page 10: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Companion Algorithm• Step 2 – Eliminate any duplication

– If one of the nodes (website) in the graph has 10 or more links plus has 95% of it links common to another node*

• Combined the links from both nodes (union) to create one node

– This is to remove sites that are likely to be the same (e.g. mirror sites, or same site under different names)

• Step 3 – Assign Edge Weights– If two nodes are on the same host then the edge between them will be set

to zero– If there are k links going to one site (i.e. many-to-one), the node edges

authority weight are set to 1/k– If there are multiple links L from one site (i.e. one-to-many), the node edges

hub weight are set to 1/L

• The vicinity graph of u has now been constructed!

*This clearly has its problems!!!

Page 11: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Companion Algorithm• Step 4 – Compute Hub and Authority scores

• Nodes (websites) with a high authority score are expected to have relevant content

• Nodes with a high hub score are expected to contain links to relevant content

• The 10 highest authority scoring nodes are then returned as relevant pages to the starting URL u

Page 12: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Co-citation Algorithm • Two sites are co-cited if they

have a common parents e.g. u3 and u1 are co-cited by p1

• Degree of co-citation (DoC) is the number of common parents a site has e.g. u3 and u1 have a DoC of 2

• The algorithm finds the sibling of a site, computes their DoC and returns the top 10 sites with the highest DoC

• If number of siblings of u < 15 and DoC of u < 2 then algorithm restarts with a URL one level up from the original e.g. If u = a.com/X/Y/Z then

new u = a.com/X/Y

p

1

u

1

p

2

u

2

u

3

Siblings of u1

Page 13: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Netscape’s Approach• “What's Related” function• Not a lot of detail mentioned in the paper!• Gets similar pages from web crawling,

archiving, categorising and data mining (as opposed to just using the web graph like the previous algorithms)

• Also tries to learn from trends (i.e what user click on after they searched for a keyword)

Page 14: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Implementation

• Compaq’s Connectivity Server– Provides 180 million URL (nodes)

• Multi-threaded server to take in URLs– Uses either the Companion or Cocitation

algorithm to find related pages.

Page 15: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Evaluation

• Studies carried out to determine the performance of these algorithms.

• Benchmark against Netscape’s approach.

• Re-visit initial requirements.– Speed

– Precision

– Little Data Input – already achieved

Page 16: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Evaluation • Speed

– 109 milliseconds for Companion, and 195ms for Cocitation.

– Complexity of the Cocitation algorithm is in the order of

O(n log n).

• Precision

Page 17: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Critique • Faults within HITS not investigated. Nomura, Satoshi, and Hayamizu,

‘Analysis and Improvement of HITS Algorithm for Detecting Web Communities’, show some of the problems with the algorithm.

• Requires the user to have found something relevant to what they are looking for. i.e. I have found NYTimes, I want to have a look at what alternatives are available.

• Can it handle the scale of the web today? Tested with 180 million connectivity information. Indexable web size stands at over 11 billion

• Links to friend’s web pages that are non-relevant to the input URL will be taken into account, consider the size of the web today, this may lead to bad results.

• Small, specialised population used in test, lack of general approach.

• 'Two click away' idea not the case today.

Page 18: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Critique Looking at the positives

• The algorithms used indeed outperform Netscape’s algorithm for finding related pages, and can be extended to handle more than one input URL*

• Easy to implement

• Many papers were consulted and used during the process of writing and implementing the work.

*at the time (1999)

Page 19: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Applications and Future Work • Data Mining - Web Structure Mining

– Finding authoritative Web pages

• Classifying Web documents– Exploring Co-cited material, if they are linked, they could

have relevance, if one is pointed to, it could be important.

• Extend the algorithm to increase the heuristic and look beyond the 'two click away idea'.

• Lack of further work because the assumption was so unrealistic to today's standards

Page 20: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Conclusion • Suggested a solution to deal with the problem of

searching for a topic that can not be easily expressed in simple text query.

• Companion and Co-citation algorithms are fast ways of doing search that is different to traditional text queries.

• Obtained a solution that can be easily adapted and implemented into web servers.

Page 21: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Q & AAny Questions?

Page 22: Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

References• Hyperlink structure of the Web G.O. Arocena, A.O. Mendelzon and G.A. Mihaila, Applications

of a web query language, in: Proc. Of the Sixth International World Wide Web Conference.

• Chakrabarti et al., ‘Enhanced Hypertext Categorisation using Hyperlinks’, in which links and their orders are used to categorise Web pages.

• E. Spertus, ‘ParaSite: Mining Structural Information on the Web’, also suggested using cocitation and other forms of connectivity to identify related Web pages ‘Authoritative Sources in a Hyperlinked Environment’. The HITS algorithm is used as a starting point for the companion algorithm, which is extended and modified.

• Linkage Similarity Measures for the Classification of Web Documents, P'avel Calado, Marco Cristo, Marcos Andr'e Gon calves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani.

• Web Mining – A Bird's eye view, presentation by Sanjay Kumar Madria