Top Banner
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London The slides are adapted from Prof. Mark Levene’s at http://www.dcs.bbk.ac.uk/~mark/download/lec2_the_structure_of_the_web.ppt
11

Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

Dec 17, 2015

Download

Documents

Neil Charles
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

Information Retrieval

Lecture 8Introduction to Information Retrieval (Manning et al. 2007)

Chapter 19

For the MSc Computer Science Programme

Dell ZhangBirkbeck, University of London

The slides are adapted from Prof. Mark Levene’s athttp://www.dcs.bbk.ac.uk/~mark/download/lec2_the_structure_of_the_web.ppt

Page 2: Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

The Size of the Web

Lawrence and Giles 1999 – 800 million

Over 11.5 billion in 2005 (Google indexes over 8 billion)

Coverage – about 40% in 1999

Overlap - low The deep (or hidden or

invisible) web contains 400-550 times more information.

Page 3: Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

Capture Recapture

SE1: the reported size of search engine 1.

QSE1 and QSE2: the pages returned for a set of queries Q from two engines.

OVR: the overlap of QSE1 and QSE2

Estimate of Web size: (QSE2 x SE1) / OVR

a.k.a. Mark and Recapture

OVR / QSE2 = SE1 / Web

Page 4: Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

Diameter of the Web

Compute Average shortest path between pairs of pages that have a path from one to the other.

Broder 99 – directed 16.2, undirected 6.8 Barabasi 99 – directed for nd.edu 19 Small diameter is a charactersitic of a small world

network Choose random source and destination – 75% of

the time no directed path between them.

Page 5: Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

Bowtie Model of the Web

Broder et al. 1999 – crawl of over 200 million pages and 1.5 billion links. SCC – 27.5% IN and OUT – 21.5% Tendrils and tubes –

21.5% Disconnected – 8%

Page 6: Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

Link Degree Distributions

How many page have n=1,2,… links: indegree : outdegree :

The log-log plots are linear!

1.2

1

n

72.2

1

n

Page 7: Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

What is a Power Law

f(i) is the proportion of objects having property i E.g. f(i) = # pages, i = # inlinks E.g. f(i) = # sites, i = # pages E.g. f(i) = # sites i = # users E.g. f(i) = frequency of word , i = rank of word, from most

freqeunt to least frequent The log-log plot: linear relationship (straight line)

i

Cif

Page 8: Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

Power Laws on the Web

inlinks (2.1) outlinks (2.72) Strongly connected components (2.54) No. of web pages in a site (2.2) No. of visitors to a site during a day (2.07) No. links clicked by web surfers (1.5) PageRank (2.1)

Page 9: Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

Preferential Attachment or The Rich Get Richer

How Power Laws Arise

Page 10: Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

Scale-Free NetworksClassic Random Graphs

Page 11: Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

Take Home Messages

The Web Graph Large and Sparse

Capture Recapture Small World Network

19 Degrees of Separation Scale Free Network

The Power Law Rich Get Richer