Top Banner
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering
29

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

1

CS 430 / INFO 430 Information Retrieval

Lecture 8

Query Refinement: Relevance Feedback

Information Filtering

Page 2: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

2

Course Administration

Page 3: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

3

Query Refinement

Search

Reformulate query

Display retrieved information

new query

reformulated query

Query formulation

EXIT

Page 4: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

4

Reformulation of Query

Manual

• Add or remove search terms

• Change Boolean operators

• Change wild cards

Automatic

Change the query vector:

• Remove/add search terms

• Change weighting of search terms

Page 5: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

5

Manual Reformulation:Vocabulary Tools

Feedback to user

• Information about stop lists, stemming, etc.

• Numbers of hits on each term or phrase

Suggestions to user

• Thesaurus

• Browse lists of terms in the inverted index

• Controlled vocabulary

Page 6: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

6

Manual Reformulation: Document Tools

Feedback to user consists of document excerpts or surrogates

• Shows the user how the system has interpreted the query

Effective at suggesting how to restrict a search

• Shows examples of false hits

Less good at suggesting how to expand a search

• No examples of missed items

Page 7: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

7

Relevance Feedback: Document Vectors as Points on a Surface

• Normalize all document vectors to be of length 1

• Then the ends of the vectors all lie on a surface with unit radius

• For similar documents, we can represent parts of this surface as a flat region

• Similar document are represented as points that are close together on this surface

Page 8: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

8

Relevance Feedback: Results of a Search

x x

xx

xx

x

hits from search

x documents found by search query

Page 9: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

9

Relevance Feedback (Concept)

x x

xx

oo

o

hits from original search

x documents identified by user as non-relevanto documents identified by user as relevant original query reformulated query

Page 10: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

10

Theoretically Best Query

x x

xx

oo

o

optimal query

x non-relevant documentso relevant documents

o

oo

x

x

x x

xx

x

x

xx

x

xx

x

Page 11: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

11

Theoretically Best Query

For a specific query, q, let:

DR be the set of all relevant documents

DN-R be the set of all non-relevant documents

sim (q, DR) be the mean similarity between query q and documents in DR

sim (q, DN-R) be the mean similarity between query q and documents in DN-R

A theoretically best query would maximize:

F = sim (q, DR) - sim (q, DN-R)

Page 12: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

12

Estimating the Best Query

In practice, DR and DN-R are not known. (The objective is to find them.)

However, the results of an initial query can be used to estimate sim (q, DR) and sim (q, DN-R).

Page 13: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

13

Rocchio's Modified Query

Modified query vector

= Original query vector

+ Mean of relevant documents found by original query

- Mean of non-relevant documents found by original query

Page 14: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

14

Rocchio's Modified Query

q1 = q0 + ri - sii =1

n1

n1

1 i =1

n2

n2

1

q0 = vector for the initial queryq1 = vector for the modified queryri = vector for relevant document isi = vector for non-relevant document in1 = number of relevant documentsn2 = number of non-relevant documents

Page 15: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

15

Difficulties with Relevance Feedback

x x

xx

oo

o

optimal query

x non-relevant documentso relevant documents original query reformulated query

o

oo

x

x

x x

xx

x

x

xx

x

xx

x

Hits from the initial query are contained in the gray shaded area

Page 16: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

16

Difficulties with Relevance Feedback

x x

xx

oo

o

optimal results set

x non-relevant documentso relevant documents original query reformulated query

o

oo

x

x

x x

xx

x

x

xx

x

xx

x

What region provides the optimal results set?

Page 17: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

17

Effectiveness of Relevance Feedback

Best when:

• Relevant documents are tightly clustered (similarities are large)

• Similarities between relevant and non-relevant documents are small

Page 18: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

18

When to Use Relevance Feedback

Relevance feedback is most important when the user wishes to increase recall, i.e., it is important to find all relevant documents.

Under these circumstances, users can be expected to put effort into searching:

• Formulate queries thoughtfully with many terms

• Review results carefully to provide feedback

• Iterate several times

• Combine automatic query enhancement with studies of thesauruses and other manual enhancements

Page 19: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

19

Relevance Feedback:Clickthrough Data

Relevance feedback methods have suffered from the unwillingness of users to provide feedback.

Joachims and others have developed methods that use Clickthrough data from online searches.

Concept:

Suppose that a query delivers a set of hits to a user.

If a user skips a link a and clicks on a link b ranked lower,then the user preference reflects rank(b) < rank(a).

Page 20: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

20

Clickthrough Example

Ranking Presented to User:

1. Kernel Machineshttp://svm.first.gmd.de/

2. Support Vector Machinehttp://jbolivar.freeservers.com/

3. SVM-Light Support Vector Machinehttp://ais.gmd.de/~thorsten/svm light/

4. An Introduction to Support Vector Machineshttp://www.support-vector.net/

5. Support Vector Machine and Kernel ... Referenceshttp://svm.research.bell-labs.com/SVMrefs.html

Ranking: (3 < 2) and (4 < 2)

User clicks on 1, 3 and 4

Joachims

Page 21: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

21

Adjusting Parameters: Relevance Feedback

q1 = q0 + ri - sii =1

n1

n1

1

i =1

n2

n2

1

, and are weights that adjust the importance of the three vectors.

If = 0, the weights provide positive feedback, by emphasizing the relevant documents in the initial set.

If = 0, the weights provide negative feedback, by reducing the emphasis on the non-relevant documents in the initial set.

Page 22: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

22

Adjusting Parameters by Weighting

The modified query can be written:

q1 = w1e1 + w2e2 + ... + wnen

where the ei are a basis for the term vector space of unit vectors corresponding to the terms in the word list and the wi are corresponding weights.

If a query is used repeatedly, optimal values of the wi can be estimated using machine learning.

Page 23: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

23

Task Application

Information Information Agents:Filtering Which news articles are interesting to a particular

person?

Text Routing Help-Desk Support:Who is an appropriate expert for a particular problem?

Text Knowledge Management:Categorization Organizing a document database by semantic

categories.

Adjusting Parameters by Machine Learning: Tasks and Applications

Joachims

Page 24: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

24

Information Filtering

d1, d2, d3, ... is a stream of incoming documents that are to be divided into two sets:

R - documents judged relevant to an information needS - documents judged not relevant to the information need

A query is defined as the vector in the term vector space:

q = (w1, w2, ..., wn)

where wi is the weight given to term i

dj will be assigned to R if similarity(q, dj) >

What is the optimal query, i.e., the optimal values of the wi?

Page 25: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

25

Seeking Optimal Parameters

Theoretical approach (not successful)

Develop a theoretical model

Derive parameters

Test with users

Heuristic approach (historically important)

Develop a heuristic

Vary parameters

Test with users

Machine learning (modern approach)

Page 26: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

26

Information Filtering: Seeking Optimal Parameters using Machine Learning

GENERAL: EXAMPLE: Text Retrieval

Input: Input:• training examples • queries with relevance judgments• design space • parameters of retrieval function

Training: Training:• automatically find the solution • find parameters so that many in design space that works well relevant documents are ranked on the training data highly

Prediction: Prediction:• predict well on new examples • rank relevant documents high

also for new queries

Page 27: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

27

Learning to Rank

Assume:• distribution of queries P(q)• distribution of target rankings for query P(r | q)

Given:• collection D of documents• independent, identically distributed training sample (qi, ri)

Design:• set of ranking functions F• loss function l(ra, rb)• learning algorithm

Goal:• find f F that minimizes l(f (q), r) integrated across all queries

Page 28: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

28

A Loss Function for Rankings

For two orderings ra and rb, a pair is:

• concordant, if ra and rb agree in their orderingP = number of concordant pairs

• discordant, if ra and rb disagree in their orderingQ = number of discordant pairs

Loss function: l(ra, rb) = Q

Example:ra = (a, c, d, b, e, f, g, h)rb = (a, b, c, d, e, f, g, h)

The discordant pairs are: (c, b), (d, b) l(ra, rb) = 2

Joachims

Page 29: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

29

Machine Learning: Algorithms

The choice of algorithms is a subject of active research, which is covered in several courses, notably CS 478 and CS/INFO 630.

Some effective methods include:

Naive Bayes

Rocchio Algorithm

C4.5 Decision Tree

k-Nearest Neighbors

Support Vector Machine