Top Banner
Łódź, 2009 Intelligent Text Processing lecture 4 IR: relevance, recall and precision, vector space model, PageRank. Szymon Grabowski [email protected] http://szgrabowski.kis.p.lodz.pl/IPT0 8/
29

Łódź, 2009

Jan 05, 2016

Download

Documents

Alpha

Intelligent Text Processing lecture 4 IR: relevance, recall and precision, vector space model, PageRank. Szymon Grabowski [email protected] http://szgrabowski.kis.p.lodz.pl/IPT08/. Łódź, 2009. Wildcard queries. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Łódź, 2009

Łódź, 2009

Intelligent Text Processinglecture 4

IR: relevance, recall and precision,vector space model, PageRank.

Szymon [email protected]

http://szgrabowski.kis.p.lodz.pl/IPT08/

Page 2: Łódź, 2009

2

Wildcard queries

It’s useful to have a * metacharacter (wildcard)replacing an arbitrary sequence of characters

(or length 0 or more).

E.g. neighbo*r to search for both neighbor and neighbour, medic* to search for medical, medicine, medically etc,

Universit* Berlin – for University... or Universität...

How to handle it?We assume the easier (=faster to handle) case,

where there is a single symbol *.

Page 3: Łódź, 2009

3

Permuterm index[ http://nlp.stanford.edu/IR-book/pdf/03dict.pdf ]

Permuterm index is a special word based index to handle wildcards.

To all terms we append the terminator $ (e.g. dog dog$, hello hello$).

Now we consider all rotations of a given termand link them to the original term.

Assume the query is h*llo.We rotate it to have the wildcard

at the end, i.e. llo$h*.

Now we can easily find in the vocabulary all terms starting with

llo$h (rotations of some ‘real’ terms).

Page 4: Łódź, 2009

4

The Information Retrieval (IR) problem[ http://www.cs.sfu.ca/~cameron/Teaching/D-Lib/IR.html ]

Given a document collection and a query, retrieve relevant documents matching the query.

What is a relevant document?

User’s judgment needed!(No clear automatic answer...)

A broader IR definition (Manning et al., An Introduction to Information Retrieval, draft, 2008):

Page 5: Łódź, 2009

5

Precision and recall

Assume we can tell easily a relevant from irrelevant document (for a given query).

The crop of a query is some collection of documents,how to estimate how good this query answer is?

Classic measures are used:

Precision – what % of retrieved documents are relevant.

Recall – what % of all relevant documents are retrieved.

Page 6: Łódź, 2009

6

Precision and recall, cont’d[ http://rakaposhi.eas.asu.edu/cse494/notes/ir-s07.ppt ]

100% precision: nothing but the truth.

100% recall: whole truth.

Ideally, 100% precision AND 100% recall:the whole truth and nothing but the truth!

(Not realistic though.)

Page 7: Łódź, 2009

7

Precision and recall, cont’d[ http://rakaposhi.eas.asu.edu/cse494/notes/ir-s07.ppt ]

fptp

tp

Precision:

Recall:fntp

tp

Page 8: Łódź, 2009

8

Recall

Precision

Precision–recall curve[ http://rakaposhi.eas.asu.edu/cse494/notes/ir-s07.ppt ]

Trying to increase the recall usually results in increased percentage of rubbish answers.

Page 9: Łódź, 2009

9

Relevance is in the eye of the beholder

If we google for jaguar (with those big cats in mind), and obtain also (or mostly) links to docs on sports cars,

are we happy?

Another ambiguous term: bush.The user might’ve meant George W. Bush, Kate Bush,

the Australian bush...

Practical evaluation problem: relevance is not binary.There may be a very relevant document

(I found an excellent Python tutorial!) or mildly relevant, or weakly relevant (Yeah, it has some info but the examples are dull and I’ve found two bugs in it...)

Page 10: Łódź, 2009

10

Typical IR system[ http://www.ee.technion.ac.il/courses/049011/spring05/lectures/lecture2.pdf ]

Page 11: Łódź, 2009

11

Search engine[ http://www.ee.technion.ac.il/courses/049011/spring05/lectures/lecture2.pdf ]

Page 12: Łódź, 2009

12

Classical vs web IR [ http://www.ee.technion.ac.il/courses/049011/spring05/lectures/lecture2.pdf ]

Page 13: Łódź, 2009

13

Classical vs web IR, comments [ http://www.ee.technion.ac.il/courses/049011/spring05/lectures/lecture2.pdf ]

On the Web: data are noisy due to duplications (e.g. site mirrors) and spam.

Web is highly dynamic: indexes must be constantly updated.

The number of matches on the Web is often large/huge,so good ranking schemes are crucial.

Esp. important for not very specific queries (e.g., Python tutorial, feline diseases).

HTML documents are not pure text: contain also images (i.e., links to them), tables etc. – harder to analyze.

Page 14: Łódź, 2009

14

Vector space model[ http://en.wikipedia.org/wiki/Vector_space_model ]

Each document represented as a vector. Each coordinate (dimension) corresponds to a

separate term (word).The more often a given word occurs in the document,

the heigher its value (weight) in the vector.

This representation serves for comparing documents for similarity or ranking documents against the query

in relevance order.

First use of the VSM: yet in 1960’s, SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval

System, Cornell Univ., Gerard Salton group.

Questions: how to assign those weights? How to compare vectors?

Page 15: Łódź, 2009

15

Term frequency–inverse document frequency (tf-idf) weight

In tf-idf, a term weight for a given document depends on its frequency in this document

(local measure), but also on how popular this termis in the whole collection (global measure).

The rarer a given term globally, the more ‘important’ itsoccurrences in a given document are.

Page 16: Łódź, 2009

16

tf-idf, cont’d[ http://en.wikipedia.org/wiki/Tf-idf ]

(Normalized) term frequency tfi,j for the term ti within document dj :

ni,j – the # of occ’s of the considered term in document dj

Inverse document frequency idfi:

|D| – total # of documents in the collection, log(x): loge(x) here

Page 17: Łódź, 2009

17

tf-idf, cont’d[ http://en.wikipedia.org/wiki/Tf-idf ]

The final weight:

Example. Let the term cactus occur 4 times in a document having 200 words in total.

Let us have a collection of 10 million documents,and let cactus occur (at least once) in 10,000 of them.

The TF-IDF score is: (4 / 200) * ln(107 / 104) = 0.02 * ln 1000 ~= 0.138.

Let’s also have the term plant in the same doc 5 times.Assume that plant occurs in 50,000 documents.

TF-IDF for plant is: (5 / 200) * ln(107 / (5*104)) ~= 0.132.

Page 18: Łódź, 2009

18

Term count model

Simpler than tf-idf: use only the ‘local’ term frequency(i.e., in the given document), no matter how frequent it occurs

globally.

So, the weight w of term t in document d is just the count of occurrences:

wt,d = tff

Similar documents are represented with similar vectors.It’s convenient to handle cosines of the angle between

the vectors (rather than angles themselves):

Page 19: Łódź, 2009

19

Cosine similarity demo (with term count model)

Page 20: Łódź, 2009

20

Cosine similarity demo, cont’d

Page 21: Łódź, 2009

21

Cosine similarity demo, results

0 – Bach1 – CPU cache2 – Saturn3 – Neptune

Greater cosine values – bigger similarity.

Common words:

(0, 2): although (0, 3): composition (1, 2): than(1, 3): larger, contains, different, usual(2, 3): atmosphere, saturn, jupiter, ice, composed, those, appearance, hydrogen, helium, interior

Page 22: Łódź, 2009

22

PageRank[ http://en.wikipedia.org/wiki/PageRank ]

PageRank (PR) is a link analysis algorithm used by theGoogle search engine that assigns a numerical weighting

to each element of a hyperlinked set of documents (e.g. WWW), with the purpose of “measuring”

its relative importance within the set. A page/site is consider “important” if there are many

links pointed to it, and, especially, if the linkscome from “important” pages.

Among documents relevant to a given query, the “important” ones have a bigger chance to be

presented among the Top 10 hits.

From http://www.google.com/technology/ :Votes cast by pages that are themselves “important” weigh more

heavily and help to make other pages “important”.

Page 23: Łódź, 2009

23

A simple network example [ http://en.wikipedia.org/wiki/File:PageRanks-Example.svg ]

The score 34.3 for C, for example, means that a web surfer who chooses a random link on every page (but with 15% likelihood jumps to a random page on the whole web)

is going to be on Page C for 34.3% of the time.

Page 24: Łódź, 2009

24

PageRank algorithm example

Assume the considered network has only 4 documents (sites):A, B, C and D. The assumption is: the total PR over the all sites is 1.

So, initially, PR(A) = PR(B) = PR(C) = PR(D) = 0.25.

But we examine also their links...

Page 25: Łódź, 2009

25

PageRank algorithm example, cont’d

Let’s examine PR(A). All the 3 remaining pages point to it, the their vote strength depends on two factors:

their own PR, and to how many pages they point to.

B points only to A, so A will get 1 * PR(B) = 1*0.25 = 0.25 from it.

C points not only to A, but also to one more site (namely B),so so A will get 0.5 * PR(C) = 0.5*0.25 = 0.125 from it.

Finally, D points not only to A, but also to 2 more sites (B and V),so so A will get 0.33 * PR(D) = 0.33*0.25 = 0.0825 from it.

So, it total: PR(A) = 0.25 + 0.125 + 0.0825 = 0.4575.

But it’s not the end...

Page 26: Łódź, 2009

26

PageRank algorithm, damping factor

Surfing the web is not only clicking!

(The user may get bored with clicking and either select some URL from his bookmarks, or type in the address.)

It is generally assumed that, at any moment, the probability that the web user continues to click a link is about d = 0.85(see slide 23 and the 15% prob. of the opposite event).

Then we assume that any web page is equally likely selected(i.e. the user makes a random jump).

The corrected PR formula (with the dumping factor taken into account) will be:

PR(A) = (1–d) / N + d * (PR(B) / out(B) + PR(C) / out(C) + ...),where N is the total number of sites, and the out(X) is the #

of outgoing links from site X, X = B, C, ....

Page 27: Łódź, 2009

27

PageRank algorithm example, cont’d

Let’s calculate PR(A), with the damping factor (d = 0.85, N = 4):PR(A) = (1–0.85) / 4 + 0.85 * (0.25 + 0.125 + 0.0825) = 0.426375.

So, it is (slightly) decreased.

But what we’ve got is still a poor approximation of the ‘real’ PageRank.

That’s because we also need to calculate PR(B), PR(C) and PR(D), and their ‘new’ values will also affect PR(A). And again, new PR(A) may affects PR(B) etc.

(won’t in our example since A has no outgoing links; but we can also look at interplay between other nodes...).

So, we have recursive dependencies (‘final’ values can be approximated in a few iterations using algebraic means).

Page 28: Łódź, 2009

28

PageRank algorithm, final words

Actually, the details of PageRank ‘in action’ are not disclosed

(and they are probably modified from time to time).

Note also that the Web is a dynamic structure, so with each visit on a given site (with each crawl),

the Google engine needs to recalculate its PR.

Also, it is important to fight (penalize) malicious attempts to increase one’s PageRank (e.g. site farms).

How exactly Google detects them is again not disclosed...

Page 29: Łódź, 2009

29

PageRank, history

The PageRank idea was developed by Larry Pageand then Sergey Brin, the latter founders of Google Inc.

(1998), from around 1995.

Scientific paper:Sergey Brin, Lawrence Page: The Anatomy of a Large-Scale

Hypertextual Web Search Engine. Computer Networks 30(1-7): 107-117 (1998).

Full text: http://www-db.stanford.edu/~backrub/google.html

The PageRank process has been patented, but the patent was assigned to Stanford University (not to Google). Now,

Google has exclusive license rights on the patent from Stanford University. The university received 1.8M shares

in exchange, which were sold in 2005 for $336M.