Top Banner
Search on the Web Victor de Boer Web Technology 2015 Slides adapted from Willem Robert van Hage
62
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Web technology: Web search

Search on the Web

Victor de Boer Web Technology 2015

Slides adapted from

Willem Robert van Hage

Page 2: Web technology: Web search

Overview

• Search engines:

– What do they do

– How do they work?

– How good are they? How to evaluate?

• Discover information laws by counting words

Page 3: Web technology: Web search

How does a search engine work?

• What is a search engine?

Page 4: Web technology: Web search

Classic Information Retrieval model

Page 5: Web technology: Web search

“Bank in Amsterdam”

Classic Information Retrieval model

Page 6: Web technology: Web search

“Bank in Amsterdam”

Classic Information Retrieval model

Page 7: Web technology: Web search

“Bank in Amsterdam”

Classic Information Retrieval model

Page 8: Web technology: Web search

How does a search engine work?

• How does a search engine know a document matches your question?

• Words have different meaning, does the search engine know which one you need ?

Page 9: Web technology: Web search

How does a search engine work?

In fact, most search engines do not know what you mean, they just make a guess. If they read “bank” they do not know if you mean a river bank or a financial institution..

They usually return the pages that makes the majority of the users happy.

/

Page 10: Web technology: Web search

How does a search engine work?

• So if you enter “bank”, the search engine does not necessarily know what you mean.

• But what if you enter “bank transfer”?

Page 11: Web technology: Web search

How does a search engine work?

Than the search engine still does not “know” what you mean, but will just return pages that both mention “bank” and “transfer”. If these correspond with what you meant that is a “mere coincidence”

Not entirely, because the word “transfer” in combination with “bank” makes the query more informative than either of them separate.

Boolean search, ad-hoc query

Page 12: Web technology: Web search

Not only ad-hoc queries

• What if you do not know or what to enter as a search term? (or do not want to?)

Page 13: Web technology: Web search

How does a search engine work?

Alternative search strategies:

• Browsing (Wikipedia, Yahoo! Directory)

• Social bookmarking (digg, de.licio.us)

• Recommender systems (stumbleupon, Amazon)

Page 14: Web technology: Web search

How does a search engine work?

• How can a search return documents from all over the web in less than a quarter of a second?

Page 15: Web technology: Web search

How does a search engine work?

Indexing (more later)

Multiple servers in parallel

Pre-selection based on time/origin/query

Page 16: Web technology: Web search

How does a search engine work?

• Does a search engine lookup the results live on the Web?

flickr/ph

oto

ph

ilde

Page 17: Web technology: Web search

How does a search engine work?

• Does a search engine maintain a copy of each document you can search for?

Page 18: Web technology: Web search

How does a search engine work?

No, the engine uses a kind of locally stored summary of each page.

Not all pages are included, duplicates and junk are thrown away

Page 19: Web technology: Web search

CRAWLING

PREPROCESSING

BUILDING INDEX

Page 20: Web technology: Web search

Crawling

• How does a search engine know your site exists?

Search engines follow links of pages they do know already, so if someone else links to your site, the engines will find you sooner or later.

This process is called “crawling”

Page 22: Web technology: Web search

Robots.txtw

ww

.s1z.ru

Page 23: Web technology: Web search

How does a search engine work?

• Can you crawl the entire web?

• How big is the web anyway?

Page 24: Web technology: Web search

Hubs

Almost. The web has the nice property that there are very few pages that link to many others and a lot of pages that link to very few other pages.

Page 25: Web technology: Web search

Deep Web

In addition, there is the "Deep web" , the part of the web that isn’t being linked to with a fixed URL (for example, data in a database)

Most of the “Deep Web” is not crawled at all.

Page 26: Web technology: Web search

How Big is the Web?

http://www.factshunt.com/2014/01/total-number-of-websites-size-of.html

759 Million - Total number of websites on the Web510 Million - Total number of Live websites (active).14.3 Trillion - Webpages, live on the Internet.48 Billion - Webpages indexed by Google.Inc.14 Billion - Webpages indexed by Microsoft's Bing.

Page 27: Web technology: Web search

Third site on the Web

Nederlands instituut voor subatomaire fysica Nikhef.

Page 28: Web technology: Web search

CRAWLING

PREPROCESSING

BUILDING INDEX

Back to building the index

Page 29: Web technology: Web search

Preprocessing

1. Remove HTML tags

2. Tokenization (“I am walking.” -> [I, am, walking])

3. Remove stop words (the, I, it,…)

4. Stemming (cars, car -> car ; walking, walks ->walk)

Result: for each doc, a list of terms

Page 30: Web technology: Web search

CRAWLING

PREPROCESSING

BUILDING INDEX

Page 31: Web technology: Web search

Term-document matrices

Page 32: Web technology: Web search

Shakespeare

Page 33: Web technology: Web search

Term-document incidence

1 if play contains word, 0 otherwise

Sec. 1.1

• So we have a 0/1 vector for each term.• To answer query: take the vectors for Brutus, Caesar and Calpurnia(complemented) bitwise AND.• 110100 AND 110111 AND 101111 = 100100.

Brutus AND Caesar BUT NOT Calpurnia

Page 34: Web technology: Web search

But? Bigger collections

• Consider 1 million documents, each with about 1000 words.

• Avg 6 bytes/word including spaces/punctuation – 6GB of data in the documents.

• Say there are M = 500K distinct terms among these.• 500K x 1M matrix has half-a-trillion 0’s and 1’s.

500.000.000.000

• But it has no more than one billion 1’s.1.000.000.000– matrix is extremely sparse: 1 / 1000.

• What’s a better representation?– We only record the 1 positions.

34

Sec. 1.1

Page 35: Web technology: Web search

Inverted indices

Page 36: Web technology: Web search

Inverted index

• For each term t, we must store a list of all documents that contain t.

– Identify each by a docID, a document serial number

36

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45 173

2 31

Sec. 1.2

174

54 101

Postings

(sorted by docID)dictionary

Page 37: Web technology: Web search

Tokenizer

Token stream. Friends Romans Countrymen

Inverted index construction

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

Documents tobe indexed.

Friends, Romans, countrymen.

Sec. 1.2

Page 38: Web technology: Web search

Indexer steps: Token sequence

• Sequence of (Modified token, Document ID) pairs.

I did enact Julius

Caesar I was killed

i' the Capitol;

Brutus killed me.

Doc 1

So let it be with

Caesar. The noble

Brutus hath told you

Caesar was ambitious

Doc 2

Sec. 1.2

Page 39: Web technology: Web search

Indexer steps: Sort

• Sort by terms– And then docID

Core indexing step

Sec. 1.2

Page 40: Web technology: Web search

Indexer steps: Dictionary & Postings

• Multiple term entries in a single document are merged.

• Split into Dictionary and Postings

• Doc. frequency information is added.

Sec. 1.2

Page 41: Web technology: Web search

Index size

• How big can your index be on a single machine?

• But let’s consider an uncompressed index of one year of Reuters news messages does that fit in main memory?

• How big does an index and dictionary become?

Page 42: Web technology: Web search

Reuters RCV1 statistics

statistic value

documents 800,000

avg. # tokens per doc 200

terms (= word types) 400,000

avg. # bytes per token 4.5(without spaces/punct.)

avg. # bytes per term 7.5

postings 100,000,000

Sec. 4.2

Page 43: Web technology: Web search

How well does a search engine work?

Measure it!

Select a representative set of queries (e.g. from a server log).

Ask a representative set of human raters to “judge” the relevance of all the search results.

Check if one engine is better than the other by counting if they return more relevant pages and less non-relevant ones (the whole truth / nothing but the truth)

For how many questions is this the case. Is this more than you would expect by pure chance?

Google

Yahoo!

Page 44: Web technology: Web search

How does a search engine work?

Page 45: Web technology: Web search

Tradeoff

better system

F-measure is the harmonic mean of precision and recall:

Page 46: Web technology: Web search

Google eye-tracking

agent-seo.com Cornell University Eye-Tracking Study Data

Page 47: Web technology: Web search

Clicks

agent-seo.com Cornell University Eye-Tracking Study Data

Page 48: Web technology: Web search

Next page?

Page 49: Web technology: Web search

Precision at N

• When the number of results grows larger, it might not be relevant what the precision over the entire set is, but only first N results.

• Precision at N/ P@n• P@1 = 1.0• P@5 = 0.6

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30

P@N

R@N

Page 50: Web technology: Web search

Ranking

State of the art search engines use all kinds of tricks for ranking.

Lets think of a few …

Page 51: Web technology: Web search

Example weighting scheme: tf.idf

Term Frequency

Inverse Document Frequency

Every word is assigned a weight for a document. Some words are more important than others.

One version:

Page 52: Web technology: Web search

TFIDF Example

TermTerm Count

this 1

is 1

a 2

sample 1

TermTerm Count

this 1

is 1

another 2

example 3

Doc1 Doc2

Page 53: Web technology: Web search

Why the “Log”

• How often does the most common word appear in a corpus? How often the second most common? Etc.– Split the books into words, cut them up on the

spaces and punctuation

– Delete all punctuation

– Sort all words

– Count the words

– Plot the counts

Page 54: Web technology: Web search

Zipf’s law

The most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

Formally: the frequency of a word is inversely proportional to its rank in the frequency table.

wu

golo

gy.com

Page 55: Web technology: Web search

Zipf’s law

On Logarithmic paper

Page 56: Web technology: Web search

But! Heaps’ Law

• Split the books into words, cut them up on the spaces and punctuation

• Delete all punctuation

• Do not sort words

• Go over all words and count the number of unique words you have seen

• Plot the results linearly.

Page 57: Web technology: Web search

Heaps’ law

• How fast does the dictionary grow?

Page 58: Web technology: Web search

Heap’s Law

Informally:

By scanning the text we will hit upon the mostcommon words rather quickly, but we will, (increasingly slower), continue to encounter (infrequent) new words.

Page 59: Web technology: Web search

Other Ranking tricks

• Localisation (language, but also your mobile location)

• Personalisation

• Log analysis

• PageRank

Page 60: Web technology: Web search

PageRank (Page and Brin)

• Absolute score for a page

• Intuition: Pages that are linked to by important pages are themselves important

i.e. the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set Bu (the set containing all pages linking to page u), divided by the number L(v) of links from page v. http://en.wikipedia.org/wiki/PageRank

Page 61: Web technology: Web search

So..

• Web search is a form of information retrieval with the Web as corpus

• Inverted indexes are built using crawling, processing and indexing

• A boolean query is then matched to the index, returning pages that match

• How well a search engine works depends on user judgement– Precision, Recall and F-measure

• Ranking is key – especially in Web search– There are many strategies for ranking, and being good

in ranking can make you very rich

Page 62: Web technology: Web search

Oh, and optimizing for Google’s ranking can make you a bit rich, and a bit cool

https://www.youtube.com/watch?v=fnSJBpB_OKQ