Top Banner

of 22

1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.

Dec 21, 2015

ReportDownload

Documents

  • Slide 1
  • 1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval
  • Slide 2
  • 2 Course Administration Assignment 1 You should have received results by email Assignment 2 Will be posted on Wednesday
  • Slide 3
  • 3 Calculation of tf.idf If you wish to check your calculation of tf.idf, you can use the test data in the two Excel spreadsheets linked from the testData page on the web site. Example: What is the tf and idf for the term monstrous? Definition: tf ij = f ij / m i From DocumentFreq1.xls, there is one posting for monstrous in file19.txt From AllFiles1.xls, f ij =1, m i = 16 tf ij =1/16 = 0.0625
  • Slide 4
  • 4 Calculation of tf.idf (continued) Definition: idf j = log 2 (n/n j ) + 1 From DocumentFreq1.xls, there is one posting for monstrous idf j = log 2 (n/n j ) + 1 = log 2 (20/1) + 1 = 5.322 tf ij.idf j = 0.0625 * 5.322 = 0.3326
  • Slide 5
  • 5 Three Approaches to Information Retrieval Many authors divide the methods of information retrieval into three categories: Boolean (based on set theory) Vector space (based on linear algebra) Probabilistic (based on Bayesian statistics) In practice, the latter two have considerable overlap.
  • Slide 6
  • 6 Probability: independent random variables and conditional probability Notation Let a, b be two events, with probability P(a) and P(b). Independent events The events a and b are independent if and only if: P(a b) = P(b) P(a) Conditional probability P(a | b) is the probability of a given b, also called the conditional probability of a given b. P(a | b) P(b) = P(a b) = P(b | a) P(a)
  • Slide 7
  • 7 Example: independent random variables and conditional probability Independent a and b are the results of throwing two dice P(a=5 | b=3) = P(a=5) = 1 / 6 Not independent a and b are the results of throwing two dice t = a + b P(t=8 | a=2) = 1 / 6 P(t=8 | a=1) = 0
  • Slide 8
  • 8 Probability Theory -- Bayesian Formulas Notation Let a, b be two events. P(a | b) is the probability of a given b Bayes Theorem P(a | b) = Derivation P(a | b) P(b) = P(a b) = P(b | a) P(a) P(b | a) P(a) P(b)P(b) P(b)P(b) where a is the event not a
  • Slide 9
  • 9 Example of Bayes Theorem Example a Weight over 200 lb. b Height over 6 ft. Over 200 lb Over 6 ft A B C D P(a | b) = D / (A+D) = D / P(b) P(b | a) = D / (D+C) = D / P(a) D is P(a b)
  • Slide 10
  • 10 Probability Ranking Principle "If a reference retrieval systems response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately a possible on the basis of whatever data is made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data." W.S. Cooper
  • Slide 11
  • 11 Probabilistic Ranking Basic concept: "For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents. By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically." Van Rijsbergen
  • Slide 12
  • 12 Concept R is a set of documents that are guessed to be relevant and R the complement of R. 1. Guess a preliminary probabilistic description of R and use it to retrieve a first set of documents. 2. Interact with the user to refine the description. 3. Repeat, thus generating a succession of approximations to R.
  • Slide 13
  • 13 Probabilistic Principle Basic concept: The probability that a document is relevant to a query is assumed to depend on the terms in the query and the terms used to index the document, only. Given a user query q, the ideal answer set, R, is the set of all relevant documents. Given a user query q and a document d j in the collection, the probabilistic model estimates the probability that the user will find d j relevant, i.e., that d j is a member of R.
  • Slide 14
  • 14 Probabilistic Principle Similarity measure: The similarity (d j, q) is the ratio of the probability that d j is relevant to q, to the probability that d j is not relevant to q. This measure runs from near zero, if the probability is small that the document is relevant, to large as the probability of relevance approaches one.
  • Slide 15
  • 15 Probabilistic Principle Given a query q and a document d j the model needs an estimate of the probability that the user finds d j relevant. i.e., P(R | d j ). similarity (d j, q) = = by Bayes Theorem = x k where k is constant P(R | d j ) P(d j | R) P(R) P(d j | R) P(d j | R) is the probability of randomly selecting d j from R.
  • Slide 16
  • 16 Binary Independence Retrieval Model (BIR) Let x = (x 1, x 2,... x n ) be the term incidence vector for d j. x i = 1 if term i is in the document and 0 otherwise. Let q = (q 1, q 2,... q n ) be the term incidence vector for the query. We estimate P(d j | R) by P(x | R) If the index terms are independent P(x | R) = P(x 1 | R) P(x 2 | R)... P(x n | R) = P(x i | R)
  • Slide 17
  • 17 Binary Independence Retrieval Model (BIR) P(x i | R) Since the x i are either 0 or 1, this can we written: P(x i = 1 | R) P(x i = 0 | R) x i = 1 P(x i = 1 | R) x i = 0 P(x i = 0 | R) S = k S = similarity (d j, q) = k
  • Slide 18
  • 18 Binary Independence Retrieval Model (BIR) For terms that appear in the query let p i = P(x i = 1 | R) r i = P(x i = 1 | R) For terms that do not appear in the query assume p i = r i p i 1 - p i x i = q i = 1 r i x i = 0, q i = 1 1 - r i p i (1 - r i ) 1 - p i x i = q i = 1 r i (1 - p i ) q i = 1 1 - r i S = k = k constant for a given query
  • Slide 19
  • 19 Binary Independence Retrieval Model (BIR) Taking logs and ignoring factors that are constant for a given query, we have: p i (1 - r i ) (1 - p i ) r i where the summation is taken over those terms that appear in both the query and the document. This similarity measure can be used to rank all documents against the query q. similarity (d, q) = log { }
  • Slide 20
  • 20 Estimates of P(x i | R) Initial guess, with no information to work from: p i = P(x i | R) = c r i = P(x i | R) = n i / N where: c is an arbitrary constant, e.g., 0.5 n i is the number of documents that contain x i N is the total number of documents in the collection
  • Slide 21
  • 21 Improving the Estimates of P(x i | R) Human feedback -- relevance feedback (discussed later) Automatically (a) Run query q using initial values. Consider the t top ranked documents. Let s i be the number of these documents that contain the term x i. (b) The new estimates are: p i = P(x i | R) = s i / t r i = P(x i | R) = (n i - s i ) / (N - t)
  • Slide 22
  • 22 Discussion of Probabilistic Model Advantages Based on firm theoretical basis Disadvantages Initial definition of R has to be guessed. Weights ignore term frequency Assumes independent index terms (as does vector model)