2003.11.13 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003

2003.11.13 - SLIDE 1IS 202 – FALL 2003

Prof. Ray Larson & Prof. Marc Davis

UC Berkeley SIMS

Tuesday and Thursday 10:30 am - 12:00 pm

Fall 2003http://www.sims.berkeley.edu/academics/courses/is202/f03/

SIMS 202:

Information Organization

and Retrieval

Lecture 19: Probabilistic IR and Relevance Feedback

2003.11.13 - SLIDE 2IS 202 – FALL 2003

Lecture Overview

• Review– Vector Representation– Term Weights– Vector Matching– Clustering

• Probabilistic Models of IR

• Relevance Feedback

Credit for some of the slides in this lecture goes to Marti Hearst

2003.11.13 - SLIDE 3IS 202 – FALL 2003

Lecture Overview





2003.11.13 - SLIDE 4IS 202 – FALL 2003

Document Vectors

ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3

2003.11.13 - SLIDE 5IS 202 – FALL 2003

Vector Space Documents and Queries

docs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3D10 0 1 1 5D11 0 0 1 4Q 1 2 3

q1 q2 q3

D1D2

D3

D4

D5

D6

D7D8

D9

D10

D11

t2

t3

t1

Boolean term combinationsQ is a query – also represented as a vector

2003.11.13 - SLIDE 6IS 202 – FALL 2003

Documents in Vector Space

t1

t2

t3

D1

D2

D10

D3

D9

D4

D7

D8

D5

D11

D6

2003.11.13 - SLIDE 7IS 202 – FALL 2003

Binary Weights

• Only the presence (1) or absence (0) of a term is included in the vector


q1 q2 q3

2003.11.13 - SLIDE 8IS 202 – FALL 2003

Raw Term Weights

• The frequency of occurrence for the term in each document is included in the vector


q1 q2 q3

2003.11.13 - SLIDE 9IS 202 – FALL 2003

tf*idf weights

)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

2003.11.13 - SLIDE 10IS 202 – FALL 2003

Inverse Document Frequency

• IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

For a collectionof 10000 documents(N = 10000)

2003.11.13 - SLIDE 11IS 202 – FALL 2003

tf*idf Normalization

• Normalize the term weights (so longer vectors are not unfairly given more weight)– Normalize usually means force all values to

fall within a certain range, usually between 0 and 1, inclusive

t

k kik

kikik

nNtf

nNtfw

1

22 )]/[log()(

)/log(

2003.11.13 - SLIDE 12IS 202 – FALL 2003

Vector Space Similarity

• Now, the similarity of two documents is:

• This is also called the cosine, or normalized inner product – The normalization was done when weighting

the terms– Note that the wik weights can be stored in the

vectors/ inverted files for the documents

),( 1

t

kjkikji wwDDsim

2003.11.13 - SLIDE 13IS 202 – FALL 2003

Vector Space Matching

1.0

0.8

0.6

0.4

0.2

0.80.60.40.20 1.0

D2

D1

Q

1

2

Term B

Term A

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)

t

j

t

j dq

t

j dq

i

ijj

ijj

ww

wwDQsim

1 1

22

1

)()(),(

Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)

98.042.0

64.0

])7.0()2.0[(])8.0()4.0[(

)7.08.0()2.04.0()2,(

2222

DQsim

74.058.0

56.),( 1 DQsim

2003.11.13 - SLIDE 14IS 202 – FALL 2003

Vector Space Visualization

2003.11.13 - SLIDE 15IS 202 – FALL 2003

Document/Document Matrix

....

.....

.....

....

....

...

21

2212

1121

21

nnn

t

t

t

ddD

ddD

ddD

DDD

jiij DDd to of similarity

2003.11.13 - SLIDE 16IS 202 – FALL 2003

Text Clustering

Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseau

Term 1

Term 2

2003.11.13 - SLIDE 18IS 202 – FALL 2003

Problems with Vector Space

• There is no real theoretical basis for the assumption of a term space– it is more for visualization that having any real

basis– most similarity measures work about the

same regardless of model• Terms are not really orthogonal

dimensions– Terms are not independent of all other terms

• Retrieval efficiency vs. indexing and update efficiency for stored pre-calculated weights

2003.11.13 - SLIDE 19IS 202 – FALL 2003

Lecture Overview





2003.11.13 - SLIDE 20IS 202 – FALL 2003

Probabilistic Models

• Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query

• Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)

• Relies on accurate estimates of probabilities

2003.11.13 - SLIDE 21IS 202 – FALL 2003

Probability Ranking Principle

• “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.”

Stephen E. Robertson, J. Documentation 1977

2003.11.13 - SLIDE 22IS 202 – FALL 2003

Model 1 – Maron and Kuhns

• Concerned with estimating probabilities of relevance at the point of indexing:– If a patron came with a request using term t i,

what is the probability that she/he would be satisfied with document Dj ?

2003.11.13 - SLIDE 23IS 202 – FALL 2003

Model 1

• A patron submits a query (call it Q) consisting of some specification of her/his information need. Different patrons submitting the same stated query may differ as to whether or not they judge a specific document to be relevant. The function of the retrieval system is to compute for each individual document the probability that it will be judged relevant by a patron who has submitted query Q.

Robertson, Maron & Cooper, 1982

2003.11.13 - SLIDE 24IS 202 – FALL 2003

Model 1 – Bayes

• A is the class of events of using the library

• Di is the class of events of Document i being judged relevant

• Ij is the class of queries consisting of the single term Ij

• P(Di|A,Ij) = probability that if a query is submitted to the system then a relevant document is retrieved

)|(

),|()|(),|(

AIP

DAIPADPIADP

j

ijiji

2003.11.13 - SLIDE 25IS 202 – FALL 2003

Model 2

• Documents have many different properties; some documents have all the properties that the patron asked for, and other documents have only some or none of the properties. If the inquiring patron were to examine all of the documents in the collection she/he might find that some having all the sought after properties were relevant, but others (with the same properties) were not relevant. And conversely, he/she might find that some of the documents having none (or only a few) of the sought after properties were relevant, others not. The function of a document retrieval system is to compute the probability that a document is relevant, given that it has one (or a set) of specified properties.

Robertson, Maron & Cooper, 1982

2003.11.13 - SLIDE 26IS 202 – FALL 2003

Model 2 – Robertson & Sparck Jones

Document Relevance

DocumentIndexing

Given a term t and a query q

+ -

+ r n-r n

- R-r N-n-R+r N-n

R N-R N

2003.11.13 - SLIDE 27IS 202 – FALL 2003

Robertson-Sparck Jones Weights

• Retrospective formulation

rRnNrnrR

r

log

2003.11.13 - SLIDE 28IS 202 – FALL 2003


• Predictive formulation

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

2003.11.13 - SLIDE 29IS 202 – FALL 2003

Probabilistic Models: Some Unifying Notation

• D = All present and future documents

• Q = All present and future queries

• (Di,Qj) = A document query pair

• x = class of similar documents,

• y = class of similar queries,

• Relevance (R) is a relation:

}Q submittinguser by therelevant judged

isDdocument ,Q ,D | )Q,{(D R

j

ijiji QD

DxQy

2003.11.13 - SLIDE 30IS 202 – FALL 2003


• Model 1 -- Probabilistic Indexing, P(R|y,Di)

• Model 2 -- Probabilistic Querying, P(R|Qj,x)

• Model 3 -- Merged Model, P(R| Qj, Di)

• Model 0 -- P(R|y,x)

• Probabilities are estimated based on prior usage or relevance estimation

2003.11.13 - SLIDE 31IS 202 – FALL 2003


QD

x

y

Di

Qj

2003.11.13 - SLIDE 32IS 202 – FALL 2003

Logistic Regression

• Another approach to estimating probability of relevance

• Based on work by William Cooper, Fred Gey and Daniel Dabney

• Builds a regression model for relevance prediction based on a set of training data

• Uses less restrictive independence assumptions than Model 2– Linked Dependence

2003.11.13 - SLIDE 33IS 202 – FALL 2003

So What’s Regression?

• A method for fitting a curve (not necessarily a straight line) through a set of points using some goodness-of-fit criterion

• The most common type of regression is linear regression

2003.11.13 - SLIDE 34IS 202 – FALL 2003

What’s Regression?

• Least Squares Fitting is a mathematical procedure for finding the best fitting curve to a given set of points by minimizing the sum of the squares of the offsets ("the residuals") of the points from the curve

• The sum of the squares of the offsets is used instead of the offset absolute values because this allows the residuals to be treated as a continuous differentiable quantity

2003.11.13 - SLIDE 35IS 202 – FALL 2003

Logistic Regression

100 -

90 -

80 -

70 -

60 -

50 -

40 -

30 -

20 -

10 -

0 - 0 10 20 30 40 50 60Term Frequency in Document

Rel

evan

ce

2003.11.13 - SLIDE 36IS 202 – FALL 2003

Probabilistic Models: Logistic Regression

• Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables

nnkji vcvcvcctdR|qO ...),,(log 22110

)),|(log(1

1),|(

ji dqROjie

dqRP

m

kkjiji ROtdqROdqRO

1, )](log),|([log),|(log

Log odds of relevance is a linear function of attributes:

Term contributions summed:

Probability of Relevance is inverse of log odds:

2003.11.13 - SLIDE 37IS 202 – FALL 2003

Logistic Regression Attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document -- logged

2003.11.13 - SLIDE 38IS 202 – FALL 2003

Logistic Regression

• Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients

• At retrieval the probability estimate is obtained by:

• For the 6 X attribute measures shown previously

6

10),|(

iii XccDQRP

2003.11.13 - SLIDE 39IS 202 – FALL 2003


• Strong theoretical basis

• In principle should supply the best predictions of relevance given available information

• Can be implemented similarly to Vector

• Relevance information is required -- or is “guestimated”

• Important indicators of relevance may not be term -- though terms only are usually used

• Optimally requires on-going collection of relevance information

Advantages Disadvantages

2003.11.13 - SLIDE 40IS 202 – FALL 2003

Vector and Probabilistic Models

• Support “natural language” queries

• Treat documents and queries the same

• Support relevance feedback searching

• Support ranked retrieval

• Differ primarily in theoretical basis and in how the ranking is calculated– Vector assumes relevance – Probabilistic relies on relevance judgments or

estimates

2003.11.13 - SLIDE 41IS 202 – FALL 2003

Current Use of Probabilistic Models

• Virtually all the major systems in TREC now use the “Okapi BM25 formula” which incorporates the Robertson-Sparck Jones weights…

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

2003.11.13 - SLIDE 42IS 202 – FALL 2003

Okapi BM25

• Where:• Q is a query containing terms T• K is k1((1-b) + b.dl/avdl)• k1, b and k3 are parameters , usually 1.2, 0.75 and 7-

1000• tf is the frequency of the term in a specific document• qtf is the frequency of the term in a topic from which Q

was derived• dl and avdl are the document length and the average

document length measured in some convenient unit• w(1) is the Robertson-Sparck Jones weight

QT qtfk

qtfk

tfK

tfkw

3

31)1( )1()1(

2003.11.13 - SLIDE 43IS 202 – FALL 2003

Language Models

• A recent addition to the probabilistic models is “language modeling” that estimates the probability that a query could have been produced by a given document.

• This is a slight variation on the other probabilistic models that has led to some modest improvements in performance

2003.11.13 - SLIDE 44IS 202 – FALL 2003

Logistic Regression and Cheshire II

• The Cheshire II system (see readings) uses Logistic Regression equations estimated from TREC full-text data

• Used for a number of production level systems here and in the U.K.

2003.11.13 - SLIDE 45IS 202 – FALL 2003

Lecture Overview





2003.11.13 - SLIDE 46IS 202 – FALL 2003

Querying in IR System

Interest profiles& Queries

Documents & data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

2003.11.13 - SLIDE 47IS 202 – FALL 2003

Relevance Feedback in an IR System

Interest profiles& Queries

Documents & data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Selected relevant docs

2003.11.13 - SLIDE 48IS 202 – FALL 2003

Query Modification

• Problem: How to reformulate the query?– Thesaurus expansion:

• Suggest terms similar to query terms

– Relevance feedback:• Suggest terms (and documents) similar to

retrieved documents that have been judged to be relevant

2003.11.13 - SLIDE 49IS 202 – FALL 2003

Relevance Feedback

• Main Idea:– Modify existing query based on relevance

judgements• Extract terms from relevant documents and add

them to the query• And/or re-weight the terms already in the query

– Two main approaches:• Automatic (pseudo-relevance feedback)• Users select relevant documents

– Users/system select terms from an automatically-generated list

2003.11.13 - SLIDE 50IS 202 – FALL 2003

Relevance Feedback

• Usually do both:– Expand query with new terms– Re-weight terms in query

• There are many variations– Usually positive weights for terms from

relevant docs– Sometimes negative weights for terms from

non-relevant docs– Remove terms ONLY in non-relevant

documents

2003.11.13 - SLIDE 51IS 202 – FALL 2003

Rocchio Method

0.25) to and 0.75 to set best to studies some(in terms

t nonrelevan andrelevant of importance thetune and ,

chosen documentsrelevant -non ofnumber the

chosen documentsrelevant ofnumber the

document relevant -non for the vector the

document relevant for the vector the

query initial for the vector the

2

1

0

121101

21

n

n

iS

iR

Q

where

Sn

Rn

QQ

i

i

i

n

i

n

ii

2003.11.13 - SLIDE 52IS 202 – FALL 2003

Rocchio/Vector Illustration

Retrieval

Information

0.5

1.0

0 0.5 1.0

D1

D2

Q0

Q’

Q”

Q0 = retrieval of information = (0.7,0.3)D1 = information science = (0.2,0.8)D2 = retrieval systems = (0.9,0.1)

Q’ = ½*Q0+ ½ * D1 = (0.45,0.55)Q” = ½*Q0+ ½ * D2 = (0.80,0.20)

2003.11.13 - SLIDE 53IS 202 – FALL 2003

Example Rocchio Calculation

)04.1,033.0,488.0,022.0,527.0,01.0,002.0,000875.0,011.0(

12

25.0

75.0

1

)950,.00.0,450,.00.0,500,.00.0,00.0,00.0,00.0(

)00.0,020,.00.0,025,.005,.00.0,020,.010,.030(.

)120,.100,.100,.025,.050,.002,.020,.009,.020(.

)120,.00.0,00.0,050,.025,.025,.00.0,00.0,030(.

121

1

2

1

new

new

Q

SRRQQ

Q

S

R

R

Relevantdocs

Non-rel doc

Original Query

Constants

Rocchio CalculationResulting feedback query

2003.11.13 - SLIDE 54IS 202 – FALL 2003

Rocchio Method

• Rocchio automatically– Re-weights terms– Adds in new terms (from relevant docs)

• Have to be careful when using negative terms• Rocchio is not a machine learning algorithm

• Most methods perform similarly– Results heavily dependent on test collection

• Machine learning methods are proving to work better than standard IR approaches like Rocchio

2003.11.13 - SLIDE 55IS 202 – FALL 2003

Probabilistic Relevance Feedback

Document Relevance

DocumentIndexing

Given a query term t

+ -

+ r n-r n

- R-r N-n-R+r N-n

R N-R N

Where N is the number of documents seen

2003.11.13 - SLIDE 56IS 202 – FALL 2003


• Retrospective formulation

rRnNrnrR

r

wnewt log

2003.11.13 - SLIDE 57IS 202 – FALL 2003

Using Relevance Feedback

• Known to improve results– In TREC-like conditions (no user involved)

• What about with a user in the loop?– How might you measure this?

2003.11.13 - SLIDE 58IS 202 – FALL 2003

Relevance Feedback Summary

• Iterative query modification can improve precision and recall for a standing query

• In at least one study, users were able to make good choices by seeing which terms were suggested for R.F. and selecting among them

2003.11.13 - SLIDE 59IS 202 – FALL 2003

Alternative Notions of Relevance Feedback

• Find people whose taste is “similar” to yours– Will you like what they like?

• Follow a users’ actions in the background– Can this be used to predict what the user will

want to see next?

• Track what lots of people are doing– Does this implicitly indicate what they think is

good and not good?

2003.11.13 - SLIDE 60IS 202 – FALL 2003

Alternative Notions of Relevance Feedback

• Several different criteria to consider:– Implicit vs. Explicit judgements – Individual vs. Group judgements– Standing vs. Dynamic topics– Similarity of the items being judged vs.

similarity of the judges themselves

2003.11.13 - SLIDE 61

Collaborative Filtering (Social Filtering)

• If Pam liked the paper, I’ll like the paper• If you liked Star Wars, you’ll like

Independence Day• Rating based on ratings of similar people

– Ignores the text, so works on text, sound, pictures, etc.

– But: Initial users can bias ratings of future users

Sally Bob Chris Lynn KarenStar Wars 7 7 3 4 7Jurassic Park 6 4 7 4 4Terminator II 3 4 7 6 3Independence Day 7 7 2 2 ?

2003.11.13 - SLIDE 62

Ringo Collaborative Filtering

• Users rate musical artists from like to dislike– 1 = detest 7 = can’t live without 4 = ambivalent– There is a normal distribution around 4– However, what matters are the extremes

• Nearest Neighbors Strategy: Find similar users and predicted (weighted) average of user ratings

• Pearson r algorithm: weight by degree of correlation between user U and user J– 1 means very similar, 0 means no correlation, -1

dissimilar– Works better to compare against the ambivalent

rating (4), rather than the individual’s average score

22 )()(

))((

JJUU

JJUUrUJ

2003.11.13 - SLIDE 63IS 202 – FALL 2003

Social Filtering

• Ignores the content, only looks at who judges things similarly

• Works well on data relating to “taste”– something that people are good at predicting about

each other too

• Does it work for topic? – GroupLens results suggest otherwise (preliminary)– Perhaps for quality assessments– What about for assessing if a document is about a

topic?

2003.11.13 - SLIDE 64

Learning Interface Agents

• Add agents in the UI, delegate tasks to them• Use machine learning to improve performance

– Learn user behavior, preferences

• Useful when:– 1) Past behavior is a useful predictor of the future– 2) Wide variety of behaviors amongst users

• Examples: – Mail clerk: Sort incoming messages in right mailboxes– Calendar manager: Automatically schedule meeting

times?

2003.11.13 - SLIDE 65IS 202 – FALL 2003

Summary

• Relevance feedback is an effective means for user-directed query modification

• Modification can be done with either direct or indirect user input

• Modification can be done based on an individual’s or a group’s past input

2003.11.13 - SLIDE 66IS 202 – FALL 2003

Next Time

• Information Retrieval Evaluation & more on collaborative filtering

• Readings– An Evaluation of Retrieval Effectiveness (Blair &

Maron); Carolyn– Rave Reviews: Acquiring Relevance Assessments

from Multiple Users (Belew, Richard); Megan– A Case for Interaction: A Study of Interactive

Information Retrieval Behavior and Effectiveness (Koeneman & Belkin); margaret Spring

– GroupLens: Applying Collaborative Filtering to Usenet News (Konstan, Joseph et. Al.); Jeff

– Social Information Filtering: Algorithms for Automating "Word of Mouth" (Shardanand, Upendra and Maes, Pattie) Rebecca

2003.11.13 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003

Documents

vector slide

d6d6 slide

inclusive slide

relevance feedback slide

marti hearst slide

documentdocument matrix

pm fall

term space