Top Banner
2011.02.16 - SLIDE 1 IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture 9: Probabilistic Retrieval
65

2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 1IS 240 – Spring 2011

Prof. Ray Larson University of California, Berkeley

School of Information

Principles of Information Retrieval

Lecture 9: Probabilistic Retrieval

Page 2: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 2IS 240 – Spring 2011

Mini-TREC• Need to make groups

– Today – Give me a note with group members (names and login names)• Systems

– SMART (not recommended…)• ftp://ftp.cs.cornell.edu/pub/smart

– MG (We have a special version if interested)• http://www.mds.rmit.edu.au/mg/welcome.html

– Cheshire II & 3• II = ftp://cheshire.berkeley.edu/pub/cheshire & http://cheshire.berkeley.edu• 3 = http://cheshire3.sourceforge.org

– Zprise (Older search system from NIST)• http://www.itl.nist.gov/iaui/894.02/works/zp2/zp2.html

– IRF (new Java-based IR framework from NIST)• http://www.itl.nist.gov/iaui/894.02/projects/irf/irf.html

– Lemur• http://www-2.cs.cmu.edu/~lemur

– Lucene (Java-based Text search engine)• http://jakarta.apache.org/lucene/docs/index.html

– Galago (Also Java-based)• http://www.galagosearch.org

– Others?? (See http://searchtools.com )

Page 3: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 3IS 240 – Spring 2011

Mini-TREC

• Proposed Schedule– February 9 – Database and previous Queries– March 2 – report on system acquisition and

setup– March 9, New Queries for testing…– April 18, Results due– April 20, Results and system rankings– April 27 Group reports and discussion

Page 4: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 4IS 240 – Spring 2011

Today

• Review– Clustering and Automatic Classification

• Probabilistic Models– Probabilistic Indexing (Model 1)– Probabilistic Retrieval (Model 2)– Unified Model (Model 3)– Model 0 and real-world IR– Regression Models– The “Okapi Weighting Formula”

Page 5: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 5IS 240 – Spring 2011

Today

• Review– Clustering and Automatic Classification

• Probabilistic Models– Probabilistic Indexing (Model 1)– Probabilistic Retrieval (Model 2)– Unified Model (Model 3)– Model 0 and real-world IR– Regression Models– The “Okapi Weighting Formula”

Page 6: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 6IS 240 – Spring 2011

Review: IR Models

• Set Theoretic Models– Boolean– Fuzzy– Extended Boolean

• Vector Models (Algebraic)

• Probabilistic Models (probabilistic)

Page 7: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 7IS 240 – Spring 2011

Similarity Measures

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

∩×

∩∪∩+∩

∩ Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

Page 8: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 8IS 240 – Spring 2011

Documents in Vector Space

t1

t2

t3

D1

D2

D10

D3

D9

D4

D7

D8

D5

D11

D6

Page 9: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 9IS 240 – Spring 2011

Vector Space Visualization

Page 10: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 10IS 240 – Spring 2011

Vector Space with Term Weights and Cosine Matching

1.0

0.8

0.6

0.4

0.2

0.80.60.40.20 1.0

D2

D1

Q

Term B

Term A

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)

∑ ∑∑

= =

==t

j

t

j dq

t

j dq

i

ijj

ijj

ww

wwDQsim

1 1

22

1

)()(),(

Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)

98.042.0

64.0

])7.0()2.0[(])8.0()4.0[(

)7.08.0()2.04.0()2,(

2222

==

+⋅+

⋅+⋅=DQsim

74.058.0

56.),( 1 ==DQsim

Page 11: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 11IS 240 – Spring 2011

Document/Document Matrix

....

.....

.....

....

....

...

21

2212

1121

21

nnn

t

t

t

ddD

ddD

ddD

DDD

jiij DDd to of similarity=

Page 12: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 12IS 240 – Spring 2011

Hierarchical Methods

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

Single Link Dissimilarity Matrix

Hierarchical methods: Polythetic, Usually Exclusive, OrderedClusters are order-independent

||||

||1

BA

BAitydissimilar

+−=

I

Page 13: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 13IS 240 – Spring 2011

Threshold = .1

Single Link Dissimilarity Matrix

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

2 03 0 04 0 0 05 1 0 0 1 1 2 3 4

2

1

35

4

Page 14: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 14IS 240 – Spring 2011

Threshold = .2

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

2 03 0 14 0 0 05 1 0 0 1 1 2 3 4

2

1

35

4

Page 15: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 15IS 240 – Spring 2011

Threshold = .3

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

2 03 0 14 1 1 15 1 0 0 1 1 2 3 4

2

1

35

4

Page 16: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 16IS 240 – Spring 2011

K-means & Rocchio Clustering

Agglomerative methods: Polythetic, Exclusive or Overlapping, Unorderedclusters are order-dependent.

DocDoc

DocDoc

DocDoc

DocDoc

1. Select initial centers (I.e. seed the space)2. Assign docs to highest matching centers and compute centroids3. Reassign all documents to centroid(s)

Rocchio’s method

Page 17: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 17IS 240 – Spring 2011

Clustering

• Advantages:– See some main themes

• Disadvantage:– Many ways documents could group together

are hidden

• Thinking point: what is the relationship to classification systems and facets?

Page 18: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 18IS 240 – Spring 2011

Automatic Class Assignment

DocDoc

DocDoc

DocDoc

Doc

SearchEngine

1. Create pseudo-documents representing intellectually derived classes.2. Search using document contents3. Obtain ranked list4. Assign document to N categories ranked over threshold. OR assign to top-ranked category

Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually orderedclusters are order-independent, usually based on an intellectually derived scheme

Page 19: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 19IS 240 – Spring 2011

Automatic Categorization in Cheshire II

• Cheshire supports a method we call “classification clustering” that relies on having a set of records that have been indexed using some controlled vocabulary.

• Classification clustering has the following steps…

Page 20: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 20IS 240 – Spring 2011

Start with a collection of documents.

Page 21: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 21IS 240 – Spring 2011

Classify and index with controlled

vocabulary.Index

Ideally, use a database

already indexed

Page 22: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 22IS 240 – Spring 2011

Problem:Controlled

Vocabularies can be

difficult for people to

use.“pass mtr veh spark ign eng”

Index

Page 23: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 23IS 240 – Spring 2011

Solution:Entry Level Vocabulary

Indexes.Index

EVIpass mtr veh

spark ign eng”

= “Automobile”

Page 24: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 24IS 240 – Spring 2011

EVI example

EVI 1

Index term:“pass mtr veh spark ign eng”User

Query “Automobile

” EVI 2Index term:“automobiles”OR

“internal combustible engines”

Page 25: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 25IS 240 – Spring 2011

But why stop there?

Index

EVI

Page 26: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 26IS 240 – Spring 2011

“Which EVI do I use?”

Index

EVI

Index

Index EVI

IndexEVI

Page 27: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 27IS 240 – Spring 2011

EVI to EVIs

Index

EVI

Index

Index EVI

IndexEVI

EVI2

Page 28: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 28IS 240 – Spring 2011

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

Why not treat language the same way?

Page 29: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 29IS 240 – Spring 2011

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

...),,2[logL(p t)W(c, 1 ++= baaStatistical association

Digital library resources

Page 30: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 30IS 240 – Spring 2011

Cheshire II - Two-Stage Retrieval

• Using the LC Classification System– Pseudo-Document created for each LC class

containing terms derived from “content-rich” portions of documents in that class (e.g., subject headings, titles, etc.)

– Permits searching by any term in the class– Ranked Probabilistic retrieval techniques attempt to

present the “Best Matches” to a query first.– User selects classes to feed back for the “second

stage” search of documents.

• Can be used with any classified/Indexed collection.

Page 31: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 31IS 240 – Spring 2011

Cheshire EVI Demo

Page 32: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 32IS 240 – Spring 2011

Problems with Vector Space

• There is no real theoretical basis for the assumption of a term space– it is more for visualization than having any

real basis– most similarity measures work about the

same regardless of model

• Terms are not really orthogonal dimensions– Terms are not independent of all other terms

Page 33: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 33IS 240 – Spring 2011

Today

• Review– Clustering and Automatic Classification

• Probabilistic Models– Probabilistic Indexing (Model 1)– Probabilistic Retrieval (Model 2)– Unified Model (Model 3)– Model 0 and real-world IR– Regression Models– The “Okapi Weighting Formula”

Page 34: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 34IS 240 – Spring 2011

Probabilistic Models

• Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query

• Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)

• Relies on accurate estimates of probabilities

Page 35: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 35IS 240 – Spring 2011

Probability Ranking Principle

• If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.

Stephen E. Robertson, J. Documentation 1977

Page 36: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 36IS 240 – Spring 2011

Model 1 – Maron and Kuhns

• Concerned with estimating probabilities of relevance at the point of indexing:– If a patron came with a request using term ti,

what is the probability that she/he would be satisfied with document Dj ?

Page 37: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 37IS 240 – Spring 2011

Probability theory (detour)

• To get to the Bayesian statistical inference used in both model 1 and 2…

The "chain rule" says :

P(A,B) = P(A ∩ B) = P(A | B)P(B) = P(B | A)P(A)

The "partition rule" says :

P(B) = P(A,B) + P(A ,B)

also "and" is distributive :

P(B,C | A) = P(C,B | A)

Page 38: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 38IS 240 – Spring 2011

Probability Theory

• The “Bayes’ Rule” (AKA: Bayesian Inference) says

P(A | B) =P(B | A)P(A)

P(B)

=P(B | A)

P(B | X)P(X)X ∈(A ,A )

⎢ ⎢

⎥ ⎥P(A)

Page 39: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 39IS 240 – Spring 2011

Bayes’ theorem

)()|()(

)|(Bp

ABpApBAp =

Bgiven A ofy probabilit :)|( BAp

A ofy probabilit :)(Ap

B ofy probabilit :)(Bp

Agiven B ofy probabilit :)|( ABp

For example: A: diseaseB: symptom

Page 40: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 40IS 240 – Spring 2011

Bayes’ Theorem: Application

Box1 Box2

p(box1) = .5P(red ball | box1) = .4P(blue ball | box1) = .6

p(box2) = .5P(red ball | box2) = .5P(blue ball | box2) = .5

...4545454545.055.25.

5.*5.6.*5.5.*5.

box2)|ball luep(box2)p(bbox1)|ball luep(box1)p(bbox2)|ball luep(box2)p(b

ball) blue(pbox2)|ball luep(box2)p(b

ball) blue|box2(p

==+

=

+=

=

Toss a fair coin. If it lands head up, draw a ball from box 1;otherwise, draw a ball from box 2. If the ball is blue, what isthe probability that it is drawn from box 2?

Page 41: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 41IS 240 – Spring 2011

Bayes’ Theorem: Application in IR

)R|)p(DRp( R)|p(R)p(DR)|p(R)p(D

p(D)R)|p(R)p(D

D)|p(R+

==

Goal: want to estimate the probability that a documentD is relevant to a given query.

D)|R)p(Rp(D)|p(R)p(R

logD)|Rp(D)|p(R

log D)|O(R log ==

It is often useful to estimate log odds of probability of relevance

D)|Rp( - 1 D)|p(R =

Page 42: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 42IS 240 – Spring 2011

Bayes’ Theorem: Application in IR

p(D | R) = ptx t (1 - pt )

1−x t

t=1

n

p(D | R ) = utx t (1 - ut )

1−x t

t=1

n

where pt = P(x t =1 | R =1) is the prob. of a term appearing in a

relevant document and ut is the prob. of appearing in a non - relevant doc.

• If documents are represented by binary vectors,

otherwise 0 and ,t'' termcontains Ddocument if 1 is xwhere

), x..., ,x,(x D

t

n21=

then

log O(R | D) = w t

t=1

n

∑ xt + constant

w t = logpt (1− qt )ut (1− ut )

Steven & Sparck Jonesterm weighting

Page 43: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 43IS 240 – Spring 2011

Bayes Theorem: Application in IR

Page 44: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 44IS 240 – Spring 2011

Model 1

• A patron submits a query (call it Q) consisting of some specification of her/his information need. Different patrons submitting the same stated query may differ as to whether or not they judge a specific document to be relevant. The function of the retrieval system is to compute for each individual document the probability that it will be judged relevant by a patron who has submitted query Q.

Robertson, Maron & Cooper, 1982

Page 45: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 45IS 240 – Spring 2011

Model 1 Bayes

• A is the class of events of using the system• Di is the class of events of Document i being

judged relevant• Ij is the class of queries consisting of the

single term Ij

• P(Di|A,Ij) = probability that if a query is submitted to the system then a relevant document is retrieved

)|(

),|()|(),|(

AIP

DAIPADPIADP

j

ijiji

⋅=

Page 46: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 46IS 240 – Spring 2011

Model 2

• Documents have many different properties; some documents have all the properties that the patron asked for, and other documents have only some or none of the properties. If the inquiring patron were to examine all of the documents in the collection she/he might find that some having all the sought after properties were relevant, but others (with the same properties) were not relevant. And conversely, he/she might find that some of the documents having none (or only a few) of the sought after properties were relevant, others not. The function of a document retrieval system is to compute the probability that a document is relevant, given that it has one (or a set) of specified properties.

Robertson, Maron & Cooper, 1982

Page 47: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 47IS 240 – Spring 2011

Model 2 – Robertson & Sparck Jones

Document Relevance

Documentindexing

Given a term t and a query q

+ -

+ r n-r n

- R-r N-n-R+r N-n

R N-R N

Page 48: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 48IS 240 – Spring 2011

Robertson-Spark Jones Weights

• Retrospective formulation --

⎟⎠⎞

⎜⎝⎛

+−−−

⎟⎠⎞

⎜⎝⎛

rRnNrnrR

r

log

Page 49: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 49IS 240 – Spring 2011

Robertson-Sparck Jones Weights

• Predictive formulation

⎟⎠⎞

⎜⎝⎛

++−−+−

⎟⎠⎞

⎜⎝⎛

+−+

=

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

Page 50: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 50IS 240 – Spring 2011

Probabilistic Models: Some Unifying Notation

• D = All present and future documents

• Q = All present and future queries

• (Di,Qj) = A document query pair

• x = class of similar documents,

• y = class of similar queries,

• Relevance is a relation:

}Q submittinguser by therelevant judged

isDdocument ,Q ,D | )Q,{(D R

j

ijiji QD ∈∈=

Dx ⊆

Qy ⊆

Page 51: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 51IS 240 – Spring 2011

Probabilistic Models

• Model 1 -- Probabilistic Indexing, P(R|y,Di)

• Model 2 -- Probabilistic Querying, P(R|Qj,x)

• Model 3 -- Merged Model, P(R| Qj, Di)

• Model 0 -- P(R|y,x)

• Probabilities are estimated based on prior usage or relevance estimation

Page 52: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 52IS 240 – Spring 2011

Probabilistic Models

QD

x

y

Di

Qj

Page 53: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 53IS 240 – Spring 2011

Logistic Regression

• Based on work by William Cooper, Fred Gey and Daniel Dabney

• Builds a regression model for relevance prediction based on a set of training data

• Uses less restrictive independence assumptions than Model 2– Linked Dependence

Page 54: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 54IS 240 – Spring 2011

Dependence assumptions

• In Model 2 term independence was assumed– P(R|A,B) = P(R|A)P(R|B)– This is not very realistic as we have discussed before

• Cooper, Gey, and Dabney proposed linked dependence: – If two or more retrieval clues are statistically

dependent in the set of all relevance-related query-document pairs then they are statistically dependent to a corresponding degree in the set of all nonrelevance-related pairs.

– Thus dependency in the relevant and nonrelevant documents is linked

Page 55: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 55IS 240 – Spring 2011

Linked Dependence

• Linked Dependence Assumption: there exists a positive real number K such that the following two conditions hold:– P(A,B|R) = K P(A|R) P(B|R)– P(A,B|R) = K P(A|R) P(B|R)– When K=1 this is the same as binary

independence

∏=

=N

i i

i

N

N

RAP

RAP

RAAP

RAAP

11

1

)|(

)|(

)|,...,(

)|,...,(

Page 56: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 56IS 240 – Spring 2011

Linked Dependence

• The Odds of an event E :

(See paper for details)

• Multiplying by O(R) and taking logs we get:

∏=

=N

i

iN

RO

ARO

RO

AARO

1

1

)(

)|(

)(

),...,|(

∑=

−+=N

iiN ROAROROAARO

11 )](log)|([log)(log),...,|(log

O(E) =P(E)

P(E )

Page 57: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 57IS 240 – Spring 2011

Logistic Regression

• The logistic function:

• The logistic function is useful because it can take as an input any value from negative infinity to positive infinity, whereas the output is confined to values between 0 and 1.

• The variable z represents the exposure to some set of independent variables, while ƒ(z) represents the probability of a particular outcome, given that set of explanatory variables.

• The variable z is a measure of the total contribution of all the independent variables used in the model and is known as the logit.€

f (z) =ez

ez +1=

1

1+ e−z

Page 58: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 58IS 240 – Spring 2011

Probabilistic Models: Logistic Regression

• Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables.

nnkji vcvcvcctdR|qO ++++= ...),,(log 22110

)),|(log(1

1),|(

ji dqROjie

dqRP −+=

∑=

−=m

kkjiji ROtdqROdqRO

1, )](log),|([log),|(log

Log odds of relevance is a linear function of attributes:

Term contributions summed:

Probability of Relevance is inverse of log odds:

Page 59: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 59IS 240 – Spring 2011

Logistic Regression

100 -90 -80 -70 -60 -50 -40 -30 -20 -10 -0 -

0 10 20 30 40 50 60Term Frequency in Document

Rel

evan

ce

Page 60: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 60IS 240 – Spring 2011

Probabilistic Models: Logistic Regression

∑=

+=6

10),|(

iii XccDQRP

Probability of relevance is based onLogistic regression from a sample set of documentsto determine values of the coefficients.At retrieval the probability estimate is obtained by:

For the 6 X attribute measures shown on the next slide

Page 61: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 61IS 240 – Spring 2011

Probabilistic Models: Logistic Regression attributes (“TREC3”)

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

=

−=

=

=

=

=

=

∑ Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document -- logged

Page 62: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 62IS 240 – Spring 2011

Current use of Probabilistic Models

• Most of the major systems in TREC now use the “Okapi BM-25 formula” (or Language Models -- more on those later) which incorporates the Robertson-Sparck Jones weights…

⎟⎠⎞

⎜⎝⎛

++−−+−

⎟⎠⎞

⎜⎝⎛

+−+

=

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

Page 63: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 63IS 240 – Spring 2011

Okapi BM-25

• Where:• Q is a query containing terms T• K is k1((1-b) + b.dl/avdl)• k1, b and k3 are parameters , usually 1.2, 0.75 and 7-1000• tf is the frequency of the term in a specific document• qtf is the frequency of the term in a topic from which Q was

derived• dl and avdl are the document length and the average

document length measured in some convenient unit (e.g. bytes)

• w(1) is the Robertson-Sparck Jones weight.

∑∈ +

+++

QT qtfk

qtfk

tfK

tfkw

3

31)1( )1()1(

Page 64: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 64IS 240 – Spring 2011

Probabilistic Models

• Strong theoretical basis

• In principle should supply the best predictions of relevance given available information

• Can be implemented similarly to Vector

• Relevance information is required -- or is “guestimated”

• Important indicators of relevance may not be term -- though terms only are usually used

• Optimally requires on-going collection of relevance information

Advantages Disadvantages

Page 65: 2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

2011.02.16 - SLIDE 65IS 240 – Spring 2011

Vector and Probabilistic Models

• Support “natural language” queries

• Treat documents and queries the same

• Support relevance feedback searching

• Support ranked retrieval

• Differ primarily in theoretical basis and in how the ranking is calculated– Vector assumes relevance – Probabilistic relies on relevance judgments or

estimates