Top Banner
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval
27

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

1

CS 430 / INFO 430 Information Retrieval

Lecture 12

Probabilistic Information Retrieval

Page 2: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

2

Course Administration

Discussion Class: Lucene

The Web site lists four questions to think about as you prepare for this discussion class. Here is an other:

Suppose that you are unhappy with the ranking of results provided by Lucene. What can you do about it?

Page 3: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

3

Course Administration

Midterm Examination

Wednesday, October 12, 7:30 to 9:00, Upson B17

The topics to be are examined are all lectures and discussion class readings before the midterm break.

See the Web site for a sample paper from a previous year.

See the Web site for instructions about laptop computers.

Page 4: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

4

Course Administration

Discussion Class on October 19

This class will be held in Philips Hall 213

Page 5: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

5

Three Approaches to Information Retrieval

Many authors divide the methods of information retrieval into three categories:

Boolean (based on set theory)

Vector space (based on linear algebra)

Probabilistic (based on Bayesian statistics)

In practice, the latter two have considerable overlap.

Page 6: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

6

Probability revision: independent random variables and conditional probability

Let a, b be two events, with probability P(a) and P(b).

Independent events

The events a and b are independent if and only if:

P(a b) = P(b) P(a)

Conditional probability

P(a | b) is the probability of a given b, also called the conditional probability of a given b.

Conditional independence

The events a1, ..., an are conditionally independent if and only if:

P(ai | b aj) = P(ai | b) for all i and j

Page 7: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

7

Example: independent random variables and conditional probability

Independent

a and b are the results of throwing two dice

P(a=5 | b=3) = P(a=5) = 1/6

Not independent

a and b are the results of throwing two dicet = a + b

P(t=8 | a=2) = 1/6 P(t=8 | a=1) = 0

Page 8: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

8

Probability: Conditional probability

P(a) = x + y

P(b) = w + x

P(a | b) = x / (w + x)

P(a | b) P(b) = P(a b) = P(b | a) P(a)

a

b

w

z

y

xa

b

Page 9: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

9

Probability Theory -- Bayes Theorem

Notation

Let a, b be two events.P(a | b) is the probability of a given b

Bayes Theorem

P(a | b) =

P(a | b) =

Derivation

P(a | b) P(b) = P(a b) = P(b | a) P(a)

P(b | a) P(a)

P(b) P(b | a) P(a)

P(b) where a is the event not a

Page 10: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

10

Probability Theory -- Bayes Theorem

Terminology used with Bayes Theorem

P(a | b) =

P(a) is called the prior probability of a

P(a | b) is called the posterior probability of b given a

P(b | a) P(a)

P(b)

Page 11: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

11

Example of Bayes Theorem

Example

a Weight over 200 lb.

b Height over 6 ft.

Over 200 lb

Over 6 ft

w

z

y

x

P(a | b) = x / (w+x) = x / P(b)

P(b | a) = x / (x+y) = x / P(a)

x is P(a b)

Page 12: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

12

Probability Ranking Principle

"If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately a possible on the basis of whatever data is made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data."

W.S. Cooper

Page 13: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

13

Probabilistic Ranking

Basic concept:

"For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents.

"By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically."

Van Rijsbergen

Page 14: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

14

Probabilistic Principle

Basic concept:

The probability that a document is relevant to a query is assumed to depend on the terms in the query and the terms used to index the document, only.

Given a user query q, the ideal answer set, R, is the set of all relevant documents.

Given a user query q and a document dj in the collection, the probabilistic model estimates the probability that the user will find dj relevant, i.e., that dj is a member of R.

Page 15: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

15

Probabilistic Principle

Initial probabilities:

Given a query q and a document dj the model needs an estimate of the probability that the user finds dj relevant. i.e., P(R | dj).

Similarity measure:

The similarity (dj, q) is the ratio of the probability that dj is relevant to q, to the probability that dj is not relevant to q.

This measure runs from near zero, if the probability is small that the document is relevant, to large as the probability of relevance approaches one.

Page 16: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

16

Probabilistic Principle

similarity (dj, q) =

= by Bayes Theorem

= x k where k is constant

P(R | dj)

P(R | dj)

P(dj | R) P(R)

P(dj | R) P(R)

P(dj | R)

P(dj | R)

P(dj | R) is the probability of randomly selecting dj from R.

Page 17: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

17

Binary Independence Retrieval Model (BIR)

Let x = (x1, x2, ... xn) be the term incidence vector for dj. xi = 1 if term i is in the document and 0 otherwise.

We estimate P(dj | R) by P(x | R)

If the index terms are independent

P(x | R) = P(x1 R) P(x2 R) ... P(xn R)

= P(x1 | R) P(x2 | R) ... P(xn | R)

= ∏ P(xi | R)

{This is known as the Naive Bayes probabilistic model.}

Page 18: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

18

Binary Independence Retrieval Model (BIR)

∏ P(xi | R)

∏ P(xi | R)

Since the xi are either 0 or 1, this can we written:

P(xi = 1 | R) P(xi = 0 | R)

xi = 1 P(xi = 1 | R) xi = 0 P(xi = 0 | R)

S = k ∏ ∏

S = similarity (dj, q) = k

Page 19: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

19

Binary Independence Retrieval Model (BIR)

For terms that appear in the query let

pi = P(xi = 1 | R)

ri = P(xi = 1 | R)

For terms that do not appear in the query assume

pi = ri

pi 1 - pi

xi = qi = 1 ri xi = 0, qi = 1 1 - ri

pi (1 - ri) 1 - pi xi = qi = 1 ri(1 - pi) qi = 1 1 - ri

S = k

= k

∏ ∏

∏ ∏constant for a given query

Page 20: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

20

Binary Independence Retrieval Model (BIR)

Taking logs and ignoring factors that are constant for a given query, we have:

pi (1 - ri ) (1 - pi) ri

where the summation is taken over those terms that appear in both the query and the document.

similarity (d, q) = ∑ log { }

Page 21: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

21

Relationship to Term Vector Space Model

Suppose that, in the term vector space, document d is represented by a vector that has component in dimension i of:

pi (1 - ri ) (1 - pi) ri

and the query q is represented by a vector with value 1 in each dimension that corresponds to a term in the vector.

Then the Binary Independence Retrieval similarity (d, q) is the inner product of these two vectors.

Thus this approach can be considered as a probabilistic way of determining term weights in the vector space model

{ }log

Page 22: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

22

Practical Application

The probabilistic model is an alternative to the term vector space model.

The Binary Independence Retrieval similarity measure is used instead of the cosine similarity measure to rank all documents against the query q.

Techniques such as stoplists and stemming can be used with either model.

Variations to the model result in slightly different expressions for the similarity measure.

Page 23: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

23

Practical Application

Early uses of probabilistic information retrieval were based on relevance feedback

R is a set of documents that are guessed to be relevant and R the complement of R.

1. Guess a preliminary probabilistic description of R and use it to retrieve a first set of documents.

2. Interact with the user to refine the description of R (relevance feedback).

3. Repeat, thus generating a succession of approximations to R.

Page 24: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

24

Initial Estimates of P(xi | R)

Initial guess, with no information to work from:

pi = P(xi | R) = c

ri = P(xi | R) = ni / N

where:

c is an arbitrary constant, e.g., 0.5ni is the number of documents that contain xi

N is the total number of documents in the collection

Page 25: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

25

Initial Similarity Estimates

pi (1 - ri ) (1 - pi) ri

= ∑ log{(N - ni)/ni}

where the summation is taken over those terms that appear in both the query and the document.

similarity (d, q) = ∑ log { }

With these assumptions:

Page 26: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

26

Improving the Estimates of P(xi | R)

Human feedback -- relevance feedback

Automatically

(a) Run query q using initial values. Consider the t top ranked documents. Let si be the number of these documents that contain the term xi.

(b) The new estimates are:

pi = P(xi | R) = si / t

ri = P(xi | R) = (ni - si) / (N - t)

Page 27: 1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

27

Discussion of Probabilistic Model

Advantages

• Based on firm theoretical basis

Disadvantages

• Initial definition of R has to be guessed.

• Weights ignore term frequency

• Assumes independent index terms (as does vector model)