Łódź, 2009

Łódź, 2009

Intelligent Text Processinglecture 4

IR: relevance, recall and precision,vector space model, PageRank.

Szymon [email protected]

http://szgrabowski.kis.p.lodz.pl/IPT08/

2

Wildcard queries

It’s useful to have a * metacharacter (wildcard)replacing an arbitrary sequence of characters

(or length 0 or more).

E.g. neighbo*r to search for both neighbor and neighbour, medic* to search for medical, medicine, medically etc,

Universit* Berlin – for University... or Universität...

How to handle it?We assume the easier (=faster to handle) case,

where there is a single symbol *.

3

Permuterm index[ http://nlp.stanford.edu/IR-book/pdf/03dict.pdf ]

Permuterm index is a special word based index to handle wildcards.

To all terms we append the terminator $ (e.g. dog dog$, hello hello$).

Now we consider all rotations of a given termand link them to the original term.

Assume the query is h*llo.We rotate it to have the wildcard

at the end, i.e. llo$h*.

Now we can easily find in the vocabulary all terms starting with

llo$h (rotations of some ‘real’ terms).

4

The Information Retrieval (IR) problem[ http://www.cs.sfu.ca/~cameron/Teaching/D-Lib/IR.html ]

Given a document collection and a query, retrieve relevant documents matching the query.

What is a relevant document?

User’s judgment needed!(No clear automatic answer...)

A broader IR definition (Manning et al., An Introduction to Information Retrieval, draft, 2008):

5

Precision and recall

Assume we can tell easily a relevant from irrelevant document (for a given query).

The crop of a query is some collection of documents,how to estimate how good this query answer is?

Classic measures are used:

Precision – what % of retrieved documents are relevant.

Recall – what % of all relevant documents are retrieved.

6

Precision and recall, cont’d[ http://rakaposhi.eas.asu.edu/cse494/notes/ir-s07.ppt ]

100% precision: nothing but the truth.

100% recall: whole truth.

Ideally, 100% precision AND 100% recall:the whole truth and nothing but the truth!

(Not realistic though.)

7

Precision and recall, cont’d[ http://rakaposhi.eas.asu.edu/cse494/notes/ir-s07.ppt ]

fptp

tp

Precision:

Recall:fntp

tp

8

Recall

Precision

Precision–recall curve[ http://rakaposhi.eas.asu.edu/cse494/notes/ir-s07.ppt ]

Trying to increase the recall usually results in increased percentage of rubbish answers.

9

Relevance is in the eye of the beholder

If we google for jaguar (with those big cats in mind), and obtain also (or mostly) links to docs on sports cars,

are we happy?

Another ambiguous term: bush.The user might’ve meant George W. Bush, Kate Bush,

the Australian bush...

Practical evaluation problem: relevance is not binary.There may be a very relevant document

(I found an excellent Python tutorial!) or mildly relevant, or weakly relevant (Yeah, it has some info but the examples are dull and I’ve found two bugs in it...)

10

Typical IR system[ http://www.ee.technion.ac.il/courses/049011/spring05/lectures/lecture2.pdf ]

11

Search engine[ http://www.ee.technion.ac.il/courses/049011/spring05/lectures/lecture2.pdf ]

12

Classical vs web IR [ http://www.ee.technion.ac.il/courses/049011/spring05/lectures/lecture2.pdf ]

13

Classical vs web IR, comments [ http://www.ee.technion.ac.il/courses/049011/spring05/lectures/lecture2.pdf ]

On the Web: data are noisy due to duplications (e.g. site mirrors) and spam.

Web is highly dynamic: indexes must be constantly updated.

The number of matches on the Web is often large/huge,so good ranking schemes are crucial.

Esp. important for not very specific queries (e.g., Python tutorial, feline diseases).

HTML documents are not pure text: contain also images (i.e., links to them), tables etc. – harder to analyze.

14

Vector space model[ http://en.wikipedia.org/wiki/Vector_space_model ]

Each document represented as a vector. Each coordinate (dimension) corresponds to a

separate term (word).The more often a given word occurs in the document,

the heigher its value (weight) in the vector.

This representation serves for comparing documents for similarity or ranking documents against the query

in relevance order.

First use of the VSM: yet in 1960’s, SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval

System, Cornell Univ., Gerard Salton group.

Questions: how to assign those weights? How to compare vectors?

15

Term frequency–inverse document frequency (tf-idf) weight

In tf-idf, a term weight for a given document depends on its frequency in this document

(local measure), but also on how popular this termis in the whole collection (global measure).

The rarer a given term globally, the more ‘important’ itsoccurrences in a given document are.

16

tf-idf, cont’d[ http://en.wikipedia.org/wiki/Tf-idf ]

(Normalized) term frequency tfi,j for the term ti within document dj :

ni,j – the # of occ’s of the considered term in document dj

Inverse document frequency idfi:

|D| – total # of documents in the collection, log(x): loge(x) here

17

tf-idf, cont’d[ http://en.wikipedia.org/wiki/Tf-idf ]

The final weight:

Example. Let the term cactus occur 4 times in a document having 200 words in total.

Let us have a collection of 10 million documents,and let cactus occur (at least once) in 10,000 of them.

The TF-IDF score is: (4 / 200) * ln(107 / 104) = 0.02 * ln 1000 ~= 0.138.

Let’s also have the term plant in the same doc 5 times.Assume that plant occurs in 50,000 documents.

TF-IDF for plant is: (5 / 200) * ln(107 / (5*104)) ~= 0.132.

18

Term count model

Simpler than tf-idf: use only the ‘local’ term frequency(i.e., in the given document), no matter how frequent it occurs

globally.

So, the weight w of term t in document d is just the count of occurrences:

wt,d = tff

Similar documents are represented with similar vectors.It’s convenient to handle cosines of the angle between

the vectors (rather than angles themselves):

19

Cosine similarity demo (with term count model)

20

Cosine similarity demo, cont’d

21

Cosine similarity demo, results

0 – Bach1 – CPU cache2 – Saturn3 – Neptune

Greater cosine values – bigger similarity.

Common words:

(0, 2): although (0, 3): composition (1, 2): than(1, 3): larger, contains, different, usual(2, 3): atmosphere, saturn, jupiter, ice, composed, those, appearance, hydrogen, helium, interior

22

PageRank[ http://en.wikipedia.org/wiki/PageRank ]

PageRank (PR) is a link analysis algorithm used by theGoogle search engine that assigns a numerical weighting

to each element of a hyperlinked set of documents (e.g. WWW), with the purpose of “measuring”

its relative importance within the set. A page/site is consider “important” if there are many

links pointed to it, and, especially, if the linkscome from “important” pages.

Among documents relevant to a given query, the “important” ones have a bigger chance to be

presented among the Top 10 hits.

From http://www.google.com/technology/ :Votes cast by pages that are themselves “important” weigh more

heavily and help to make other pages “important”.

23

A simple network example [ http://en.wikipedia.org/wiki/File:PageRanks-Example.svg ]

The score 34.3 for C, for example, means that a web surfer who chooses a random link on every page (but with 15% likelihood jumps to a random page on the whole web)

is going to be on Page C for 34.3% of the time.

24

PageRank algorithm example

Assume the considered network has only 4 documents (sites):A, B, C and D. The assumption is: the total PR over the all sites is 1.

So, initially, PR(A) = PR(B) = PR(C) = PR(D) = 0.25.

But we examine also their links...

25

PageRank algorithm example, cont’d

Let’s examine PR(A). All the 3 remaining pages point to it, the their vote strength depends on two factors:

their own PR, and to how many pages they point to.

B points only to A, so A will get 1 * PR(B) = 1*0.25 = 0.25 from it.

C points not only to A, but also to one more site (namely B),so so A will get 0.5 * PR(C) = 0.5*0.25 = 0.125 from it.

Finally, D points not only to A, but also to 2 more sites (B and V),so so A will get 0.33 * PR(D) = 0.33*0.25 = 0.0825 from it.

So, it total: PR(A) = 0.25 + 0.125 + 0.0825 = 0.4575.

But it’s not the end...

26

PageRank algorithm, damping factor

Surfing the web is not only clicking!

(The user may get bored with clicking and either select some URL from his bookmarks, or type in the address.)

It is generally assumed that, at any moment, the probability that the web user continues to click a link is about d = 0.85(see slide 23 and the 15% prob. of the opposite event).

Then we assume that any web page is equally likely selected(i.e. the user makes a random jump).

The corrected PR formula (with the dumping factor taken into account) will be:

PR(A) = (1–d) / N + d * (PR(B) / out(B) + PR(C) / out(C) + ...),where N is the total number of sites, and the out(X) is the #

of outgoing links from site X, X = B, C, ....

27

PageRank algorithm example, cont’d

Let’s calculate PR(A), with the damping factor (d = 0.85, N = 4):PR(A) = (1–0.85) / 4 + 0.85 * (0.25 + 0.125 + 0.0825) = 0.426375.

So, it is (slightly) decreased.

But what we’ve got is still a poor approximation of the ‘real’ PageRank.

That’s because we also need to calculate PR(B), PR(C) and PR(D), and their ‘new’ values will also affect PR(A). And again, new PR(A) may affects PR(B) etc.

(won’t in our example since A has no outgoing links; but we can also look at interplay between other nodes...).

So, we have recursive dependencies (‘final’ values can be approximated in a few iterations using algebraic means).

28

PageRank algorithm, final words

Actually, the details of PageRank ‘in action’ are not disclosed

(and they are probably modified from time to time).

Note also that the Web is a dynamic structure, so with each visit on a given site (with each crawl),

the Google engine needs to recalculate its PR.

Also, it is important to fight (penalize) malicious attempts to increase one’s PageRank (e.g. site farms).

How exactly Google detects them is again not disclosed...

29

PageRank, history

The PageRank idea was developed by Larry Pageand then Sergey Brin, the latter founders of Google Inc.

(1998), from around 1995.

Scientific paper:Sergey Brin, Lawrence Page: The Anatomy of a Large-Scale

Hypertextual Web Search Engine. Computer Networks 30(1-7): 107-117 (1998).

Full text: http://www-db.stanford.edu/~backrub/google.html

The PageRank process has been patented, but the patent was assigned to Stanford University (not to Google). Now,

Google has exclusive license rights on the patent from Stanford University. The university received 1.8M shares

in exchange, which were sold in 2005 for $336M.

Łódź, 2009

Documents

contd http

relevant documents

comments http

permuterm index http

classical vs web ir

search engine http

given query

typical ir system http