Top Banner

of 104

Slides Chap05

Jun 04, 2018

Download

Documents

satishk1
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/13/2019 Slides Chap05

    1/104

    Modern Information Retrieval

    Chapter 5Relevance Feedback and

    Query Expansion

    Introduction

    A Framework for Feedback Methods

    Explicit Relevance Feedback

    Explicit Feedback Through Clicks

    Implicit Feedback Through Local Analysis

    Implicit Feedback Through Global Analysis

    Trends and Research Issues

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 1

  • 8/13/2019 Slides Chap05

    2/104

    Introduction

    Most users find it difficult to formulate queries that arewell designed for retrieval purposes

    Yet, most users often need to reformulate their queries

    to obtain the results of their interestThus, the first query formulation should be treated as an initial

    attempt to retrieve relevant information

    Documents initially retrieved could be analyzed for relevance andused to improve initial query

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 2

  • 8/13/2019 Slides Chap05

    3/104

    Introduction

    The process of query modification is commonly referredas

    relevance feedback, when the user provides information on

    relevant documents to a query, orquery expansion, when information related to the query is used

    to expand it

    We refer to both of them as feedback methodsTwo basic approaches of feedback methods:

    explicit feedback, in which the information for query

    reformulation is provided directly by the users, and

    implicit feedback, in which the information for query

    reformulation is implicitly derived by the system

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 3

  • 8/13/2019 Slides Chap05

    4/104

    A Framework for Feedback Methods

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 4

  • 8/13/2019 Slides Chap05

    5/104

    A Framework

    Consider a set of documentsDr that are known to berelevant to the current queryq

    In relevance feedback, the documents inDr are used to

    transformqinto a modified queryqmHowever, obtaining information on documents relevantto a query requires the direct interference of the user

    Most users are unwilling to provide this information, particularly inthe Web

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 5

  • 8/13/2019 Slides Chap05

    6/104

    A Framework

    Because of this high cost, the idea of relevancefeedback has been relaxed over the years

    Instead of asking the users for the relevant documents,

    we could:Look at documents they have clicked on; or

    Look at terms belonging to the top documents in the result set

    In both cases, it is expect that the feedback cycle willproduce results of higher quality

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 6

  • 8/13/2019 Slides Chap05

    7/104

    A Framework

    Afeedback cycleis composed of two basic steps:Determine feedback information that is either related or expected

    to be related to the original query qand

    Determine how to transform query qto take this informationeffectively into account

    The first step can be accomplished in two distinct ways:

    Obtain the feedback information explicitly from the users

    Obtain the feedback information implicitly from the query results

    or from external sources such as athesaurus

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 7

  • 8/13/2019 Slides Chap05

    8/104

    A Framework

    In anexplicit relevance feedbackcycle, the feedbackinformation is provided directly by the users

    However, collecting feedback information is expensive

    and time consumingIn the Web,user clickson search results constitute anew source of feedback information

    A click indicate a document that is of interest to the userin the context of the current query

    Notice that a click does not necessarily indicate a document that

    is relevant to the query

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 8

  • 8/13/2019 Slides Chap05

    9/104

    Explicit Feedback Information

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 9

  • 8/13/2019 Slides Chap05

    10/104

    A Framework

    In animplicit relevance feedbackcycle, the feedbackinformation is derived implicitly by the system

    There are two basic approaches for compiling implicit

    feedback information:local analysis, which derives the feedback information from the

    top ranked documents in the result set

    global analysis, which derives the feedback information fromexternal sources such as a thesaurus

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 10

  • 8/13/2019 Slides Chap05

    11/104

    Implicit Feedback Information

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 11

  • 8/13/2019 Slides Chap05

    12/104

    Explicit Relevance Feedback

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 12

  • 8/13/2019 Slides Chap05

    13/104

    Explicit Relevance Feedback

    In a classic relevance feedback cycle, the user ispresented with a list of the retrieved documents

    Then, the user examines them and marks those that

    are relevantIn practice, only the top 10 (or 20) ranked documentsneed to be examined

    The main idea consists ofselecting important terms from the documents that have been

    identified as relevant, and

    enhancing the importance of these terms in a new queryformulation

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 13

    E li i R l F db k

  • 8/13/2019 Slides Chap05

    14/104

    Explicit Relevance Feedback

    Expected effect: the new query will be moved towardsthe relevant docs and away from the non-relevant ones

    Early experiments have shown good improvements in

    precision for small test collectionsRelevance feedback presents the followingcharacteristics:

    it shields the user from the details of the query reformulationprocess (all the user has to provide is a relevance judgement)

    it breaks down the whole searching task into a sequence of small

    steps which are easier to grasp

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 14

  • 8/13/2019 Slides Chap05

    15/104

    The Rocchio Method

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 15

  • 8/13/2019 Slides Chap05

    16/104

    The Rocchio Method

    Documents identified as relevant (to a given query)have similarities among themselves

    Further, non-relevant docs have term-weight vectors

    which are dissimilar from the relevant documentsThe basic idea of the Rocchio Method is to reformulatethe query such that it gets:

    closer to the neighborhood of the relevant documents in thevector space, and

    away from the neighborhood of the non-relevant documents

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 16

    Th R hi M h d

  • 8/13/2019 Slides Chap05

    17/104

    The Rocchio Method

    Let us define terminology regarding the processing of agiven queryq, as follows:

    Dr: set ofrelevantdocuments among the documents retrieved

    Nr: number of documents in setDr

    Dn: set ofnon-relevantdocs among the documents retrieved

    Nn: number of documents in setDn

    Cr: set of relevant docs among all documents in the collectionN: number of documents in the collection

    , , : tuning constants

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 17

    Th R hi M th d

  • 8/13/2019 Slides Chap05

    18/104

    The Rocchio Method

    Consider that the setCr is known in advanceThen, the best query vector for distinguishing therelevant from the non-relevant docs is given by

    qopt= 1

    |Cr|

    djCr

    dj 1N |Cr|

    djCr

    dj

    where

    |Cr|refers to the cardinality of the setCrdj is a weighted term vector associated with document dj, and

    qopt is the optimal weighted term vector for query q

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 18

    Th R hi M th d

  • 8/13/2019 Slides Chap05

    19/104

    The Rocchio Method

    However, the setCr is not known a prioriTo solve this problem, we can formulate an initial queryand to incrementally change the initial query vector

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 19

    Th R hi M th d

  • 8/13/2019 Slides Chap05

    20/104

    The Rocchio Method

    There are three classic and similar ways to calculatethe modified queryqm as follows,

    Standard_Rocchio: qm = q +

    Nr

    djDr

    dj

    Nn

    djDn

    dj

    Ide_Regular: qm = q +

    djDr

    dj

    djDn

    dj

    Ide_Dec_Hi: qm = q +

    djDr

    dj max_rank(Dn)

    wheremax_rank(Dn)is the highest rankednon-relevant doc

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 20

    Th R hi M th d

  • 8/13/2019 Slides Chap05

    21/104

    The Rocchio Method

    Three different setups of the parameters in the Rocchioformula are as follows:

    = 1, proposed by Rocchio

    === 1, proposed by Ide= 0, which yields apositivefeedback strategy

    The current understanding is that the three techniques yield

    similar resultsThe main advantages of the above relevance feedbacktechniques are simplicity and good results

    Simplicity: modified term weights are computed directly from theset of retrieved documents

    Good results: the modified query vector does reflect a portion of

    the intended query semantics (observed experimentally)

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 21

  • 8/13/2019 Slides Chap05

    22/104

    Relevance Feedback for the Probabilistic

    Model

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 22

    A Probabilistic Method

  • 8/13/2019 Slides Chap05

    23/104

    A Probabilistic Method

    The probabilistic model ranks documents for a queryqaccording to the probabilistic ranking principle

    The similarity of a documentdj to a queryqin the

    probabilistic model can be expressed as

    sim(dj , q)

    kiqkidj

    log

    P(ki|R)1

    P(ki

    |R)

    + log1 P(ki|R)

    P(ki|R)

    where

    P(ki

    |R)stands for the probability of observing the term ki in the

    setRof relevant documents

    P(ki|R)stands for the probability of observing the term ki in thesetRof non-relevant docs

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 23

    A Probabilistic Method

  • 8/13/2019 Slides Chap05

    24/104

    A Probabilistic Method

    Initially, the equation above cannot be used becauseP(ki|R)andP(ki|R)are unknownDifferent methods for estimating these probabilities

    automatically were discussed in Chapter 3With user feedback information, these probabilities areestimated in a slightly different way

    For the initial search (when there are no retrieveddocuments yet), assumptions often made include:

    P(ki|R)is constant for all termski (typically 0.5)the term probability distributionP(ki|R)can be approximated bythe distribution in the whole collection

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 24

    A Probabilistic Method

  • 8/13/2019 Slides Chap05

    25/104

    A Probabilistic Method

    These two assumptions yield:

    P(ki|R) = 0.5 P(ki|R) = niN

    wherenistands for the number of documents in thecollection that contain the termki

    Substituting into similarity equation, we obtain

    siminitial(dj , q) =

    kiqkidj

    logN ni

    ni

    For the feedback searches, the accumulated statisticson relevance are used to evaluateP(ki|R)andP(ki|R)

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 25

    A Probabilistic Method

  • 8/13/2019 Slides Chap05

    26/104

    A Probabilistic Method

    Letnr,ibe the number of documents in setDr thatcontain the termki

    Then, the probabilitiesP(ki|R)andP(ki|R)can beapproximated by

    P(ki|R) = nr,iNr

    P(ki|R) = ni nr,iN

    Nr

    Using these approximations, the similarity equation canrewritten as

    sim(dj , q) =

    kiqkidj

    log

    nr,i

    Nr nr,i + logN Nr (ni nr,i)

    ni nr,i

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 26

    A Probabilistic Method

  • 8/13/2019 Slides Chap05

    27/104

    A Probabilistic Method

    Notice that here, contrary to the Rocchio Method, noquery expansion occurs

    The same query terms are reweighted using feedback

    information provided by the userThe formula above poses problems for certain smallvalues ofNr andnr,i

    For this reason, a 0.5 adjustment factor is often addedto the estimation ofP(ki|R)andP(ki|R):

    P(ki|R) = nr,i+ 0.5Nr+ 1

    P(ki|R) = ni nr,i+ 0.5NNr+ 1

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 27

    A Probabilistic Method

  • 8/13/2019 Slides Chap05

    28/104

    A Probabilistic Method

    The main advantage of this feedback method is thederivation of new weights for the query terms

    The disadvantages include:

    document term weights arenottaken into account during thefeedback loop;

    weights of terms in the previous query formulations are

    disregarded; andno query expansion is used (the same set of index terms in the

    original query is reweighted over and over again)

    Thus, this method doesnotin general operate aseffectively as the vector modification methods

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 28

  • 8/13/2019 Slides Chap05

    29/104

    Evaluation of Relevance Feedback

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 29

    Evaluation of Relevance Feedback

  • 8/13/2019 Slides Chap05

    30/104

    Evaluation of Relevance Feedback

    Consider the modified query vectorqm produced byexpandingqwith relevant documents, according to theRocchio formula

    Evaluation ofqm

    :

    Compare the documents retrieved byqm with the set of relevant

    documents forq

    In general, the results show spectacular improvements

    However, a part of this improvement results from the higher ranks

    assigned to the relevant docs used to expand qinto qm

    Since the user has seen these docs already, such evaluation is

    unrealistic

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 30

    The Residual Collection

  • 8/13/2019 Slides Chap05

    31/104

    The Residual Collection

    A more realistic approach is to evaluateqm consideringonly theresidual collection

    We call residual collection the set of all docs minus the set of

    feedback docs provided by the user

    Then, the recall-precision figures forqm tend to be lowerthan the figures for the original query vector q

    This is not a limitation because the main purpose of theprocess is to compare distinct relevance feedbackstrategies

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 31

  • 8/13/2019 Slides Chap05

    32/104

    Explicit Feedback Through Clicks

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 32

    Explicit Feedback Through Clicks

  • 8/13/2019 Slides Chap05

    33/104

    Explicit Feedback Through Clicks

    Web search engine users not only inspect the answersto their queries, they also click on them

    The clicks reflect preferences for particular answers inthe context of a given query

    They can be collected in large numbers withoutinterfering with the user actions

    The immediate question is whether they also reflectrelevance judgements on the answers

    Under certain restrictions, the answer is affirmative as

    we now discuss

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 33

    Eye Tracking

  • 8/13/2019 Slides Chap05

    34/104

    Eye Tracking

    Clickthrough data provides limited information on theuser behavior

    One approach to complement information on userbehavior is to useeye tracking devices

    Such commercially available devices can be used todetermine the area of the screen the user is focussed in

    The approach allows correctly detecting the area of thescreen of interest to the user in 60-90% of the cases

    Further, the cases for which the method does not workcan be determined

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 34

    Eye Tracking

  • 8/13/2019 Slides Chap05

    35/104

    y g

    Eye movements can be classified in four types:fixations, saccades, pupil dilation, and scan paths

    Fixationsare a gaze at a particular area of the screenlasting for 200-300 milliseconds

    This time interval is large enough to allow effective braincapture and interpretation of the image displayed

    Fixations are the ocular activity normally associatedwith visual information acquisition and processing

    That is, fixations are key to interpreting user behavior

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 35

    Relevance Judgements

  • 8/13/2019 Slides Chap05

    36/104

    g

    To evaluate thequalityof the results, eye tracking is notappropriate

    This evaluation requires selecting a set of test queriesand determining relevance judgements for them

    This is also the case if we intend to evaluate the qualityof the signal produced by clicks

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 36

    User Behavior

  • 8/13/2019 Slides Chap05

    37/104

    User Behavior

    Eye tracking experiments have shown that users scanthe query results from top to bottom

    The users inspect the first and second results rightaway, within the second or third fixation

    Further, they tend to scan the top 5 or top 6 answersthoroughly, before scrolling down to see other answers

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 37

    User Behavior

  • 8/13/2019 Slides Chap05

    38/104

    Percentage of times each one of the top results wasviewed and clicked on by a user, for 10 test tasks and29 subjects (Thorsten Joachimset al)

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 38

    User Behavior

    http://portal.acm.org/citation.cfm?id=1229179.1229181http://portal.acm.org/citation.cfm?id=1229179.1229181http://portal.acm.org/citation.cfm?id=1229179.1229181
  • 8/13/2019 Slides Chap05

    39/104

    We notice that the users inspect the top 2 answersalmost equally, but they click three times more in thefirst

    This might be indicative of a user bias towards thesearch engine

    That is, that the users tend to trust the search engine in

    recommending a top result that is relevant

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 39

    User Behavior

  • 8/13/2019 Slides Chap05

    40/104

    This can be better understood by presenting testsubjects with two distinct result sets:

    the normal ranking returned by the search engine and

    a modified ranking in which the top 2 results have their positionsswapped

    Analysis suggest that the user displays atrust biasinthe search engine that favors the top result

    That is, the position of the result has great influence onthe users decision to click on it

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 40

    Clicks as a Metric of Preferences

  • 8/13/2019 Slides Chap05

    41/104

    Thus, it is clear that interpreting clicks as a directindicative of relevance is not the best approach

    More promising is to interpret clicks as a metric ofuserpreferences

    For instance, a user can look at a result and decide toskip it to click on a result that appears lower

    In this case, we say that the user prefers the resultclicked on to the result shown upper in the ranking

    This type of preference relation takes into account:

    the results clicked on by the user

    the results that were inspected and not clicked on

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 41

    Clicks within a Same Query

  • 8/13/2019 Slides Chap05

    42/104

    To interpret clicks as user preferences, we adopt thefollowing definitions

    Given a ranking functionR(qi, dj), letrk be thekth ranked result

    That is,r1, r2, r3stand for the first, the second, and the third topresults, respectively

    Further, let

    rk indicate that the user has clicked on thekth result

    Define a preference functionrk > rkn,0< k

    n < k, that states

    that, according to the click actions of the user, thekth top result is

    preferrable to the (k n)th result

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 42

    Clicks within a Same Query

  • 8/13/2019 Slides Chap05

    43/104

    To illustrate, consider the following example regardingthe click behavior of a user:

    r1 r2

    r3 r4

    r5 r6 r7 r8 r9

    r10

    This behavior does not allow us to make definitivestatements about the relevance of resultsr3,r5, andr10

    However, it does allow us to make statements on the

    relative preferences of this userTwo distinct strategies to capture the preferencerelations in this case are as follows.

    Skip-Above: ifrk thenrk > rkn, for allrkn that was not clickedSkip-Previous: if

    rk andrk1 has not been clicked then rk > rk1

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 43

    Clicks within a Same Query

  • 8/13/2019 Slides Chap05

    44/104

    To illustrate, consider again the following exampleregarding the click behavior of a user:

    r1 r2

    r3 r4

    r5 r6 r7 r8 r9

    r10

    According to the Skip-Above strategy, we have:

    r3> r2; r3> r1

    And, according to the Skip-Previous strategy, we have:

    r3> r2

    We notice that the Skip-Above strategy produces morepreference relations than the Skip-Previous strategy

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 44

    Clicks within a Same Query

  • 8/13/2019 Slides Chap05

    45/104

    Empirical results indicate that user clicks are inagreement with judgements on the relevance of resultsin roughly 80% of the cases

    Both the Skip-Above and the Skip-Previous strategies produce

    preference relations

    If we swap the first and second results, the clicks still reflect

    preference relations, for both strategies

    If we reverse the order of the top 10 results, the clicks still reflect

    preference relations, for both strategies

    Thus, the clicks of the users can be used as a strong

    indicative of personal preferences

    Further, they also can be used as a strong indicative ofthe relative relevance of the results for a given query

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 45

    Clicks within a Query Chain

  • 8/13/2019 Slides Chap05

    46/104

    The discussion above was restricted to the context of asingle query

    However, in practice, users issue more than one queryin their search for answers to a same task

    The set of queries associated with a same task can beidentified inlive query streams

    This set constitute what is referred to as a query chainThe purpose of analysing query chains is to producenew preference relations

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 46

    Clicks within a Query Chain

  • 8/13/2019 Slides Chap05

    47/104

    To illustrate, consider that two result sets in a samequery chain led to the following click actions:

    r1 r2 r3 r4 r5 r6 r7 r8 r9 r10s1

    s2 s3 s4

    s5 s6 s7 s8 s9 s10

    where

    rj refers to an answer in the first result set

    sj refers to an answer in the second result set

    In this case, the user only clicked on the second and

    fifth answers of the second result set

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 47

    Clicks within a Query Chain

  • 8/13/2019 Slides Chap05

    48/104

    Two distinct strategies to capture the preference relations

    in this case, are as follows

    Top-One-No-Click-Earlier: ifsk | sk thensj > r1, forj10.

    Top-Two-No-Click-Earlier: ifsk | sk thensj > r1 andsj > r2, forj10

    According the first strategy, the following preferences are

    produced by the click of the user on results2:

    s1> r1; s2> r1; s3> r1; s4> r1; s5> r1; . . .

    According the second strategy, we have:

    s1> r1; s2> r1; s3> r1; s4> r1; s5> r1; . . .

    s1> r2; s2> r2; s3> r2; s4> r2; s5> r2; . . .

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 48

    Clicks within a Query Chain

  • 8/13/2019 Slides Chap05

    49/104

    We notice that the second strategy produces twicemore preference relations than the first

    These preference relations must be compared with therelevance judgements of the human assessors

    The following conclusions were derived:

    Both strategies produce preference relations in agreement with

    the relevance judgements in roughly 80% of the casesSimilar agreements are observed even if we swap the first and

    second results

    Similar agreements are observed even if we reverse the order of

    the results

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 49

    Clicks within a Query Chain

  • 8/13/2019 Slides Chap05

    50/104

    These results suggest:

    The users provide negative feedback on whole result sets (by not

    clicking on them)

    The users learn with the process and reformulate better queries

    on the subsequent iterations

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 50

  • 8/13/2019 Slides Chap05

    51/104

    Click-based Ranking

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 51

    Click-based Ranking

  • 8/13/2019 Slides Chap05

    52/104

    Click through information can be used to improve theranking

    This can be done bylearninga modified rankingfunction from click-based preferences

    One approach is to use support vector machines(SVMs) to learn the ranking function

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 52

    Click-based Ranking

  • 8/13/2019 Slides Chap05

    53/104

    In this case, preference relations are transformed intoinequalities among weighted term vectors representingthe ranked documents

    These inequalities are then translated into an SVM

    optimization problem

    The solution of this optimization problem is the optimalweights for the document terms

    The approach proposes the combination of differentretrieval functions with different weights

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 53

  • 8/13/2019 Slides Chap05

    54/104

    Implicit Feedback Through Local Analysis

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 54

    Local analysis

  • 8/13/2019 Slides Chap05

    55/104

    Local analysis consists in deriving feedback informationfrom the documents retrieved for a given query q

    This is similar to a relevance feedback cycle but donewithout assistance from the user

    Two local strategies are discussed here: localclusteringand localcontext analysis

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 55

  • 8/13/2019 Slides Chap05

    56/104

    Local Clustering

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 56

    Local Clustering

  • 8/13/2019 Slides Chap05

    57/104

    Adoption of clustering techniques for query expansionhas been a basic approach in information retrieval

    The standard procedure is to quantify term correlationsand then use the correlated terms for query expansion

    Term correlations can be quantified by using globalstructures, such as association matrices

    However, global structures might not adapt well to thelocal context defined by the current query

    To deal with this problem,local clusteringcan beused, as we now discuss

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 57

    Association Clusters

  • 8/13/2019 Slides Chap05

    58/104

    For a given queryq, let

    D: local document set, i.e., set of documents retrieved byq

    N: number of documents in Dl

    Vl: local vocabulary, i.e., set of all distinct words in Dlfi,j: frequency of occurrence of a term ki in a documentdj DlM=[mij]: term-document matrix withVl rows andNl columns

    mij=fi,j : an element of matrixM

    MT: transpose ofM

    The matrix

    C=MMT

    is a local term-term correlation matrix

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 58

    Association Clusters

  • 8/13/2019 Slides Chap05

    59/104

    Each elementcu,vC expresses a correlation

    between termsku andkv

    This relationship between the terms is based on theirjoint co-occurrences inside documents of the collection

    Higher the number of documents in which the two termsco-occur, stronger is this correlation

    Correlation strengths can be used to define localclusters of neighbor terms

    Terms in a same cluster can then be used for queryexpansion

    We consider three types of clusters here: associationclusters,metric clusters, andscalar clusters.

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 59

    Association Clusters

  • 8/13/2019 Slides Chap05

    60/104

    An association cluster is computed from a localcorrelation matrix C

    For that, we re-define the correlation factors cu,vbetween any pair of termsku andkv, as follows:

    cu,v =djDl

    fu,j fv,j

    In this case the correlation matrix is referred to as alocal association matrix

    The motivation is that terms that co-occur frequently

    inside documents have a synonymity association

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 60

    Association Clusters

  • 8/13/2019 Slides Chap05

    61/104

    The correlation factorscu,v and the association matrixC are said to be unnormalized

    An alternative is to normalize the correlation factors:

    cu,v = cu,v

    cu,u+cv,v cu,vIn this case the association matrix C is said to be

    normalized

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 61

    Association Clusters

  • 8/13/2019 Slides Chap05

    62/104

    Given a local association matrix C, we can use it tobuild local association clusters as follows

    LetCu(n)be a function that returns thenlargest factorscu,v

    C, wherev varies over the set of local terms and

    v=uThen,Cu(n)defines a local association cluster, aneighborhood, around the termku

    Given a queryq, we are normally interested in findingclusters only for the|q|query termsThis means that such clusters can be computedefficiently at query time

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 62

    Metric Clusters

  • 8/13/2019 Slides Chap05

    63/104

    Association clusters do not take into account where theterms occur in a document

    However, two terms that occur in a same sentence tendto be more correlated

    Ametric clusterre-defines the correlation factorscu,vas a function of their distances in documents

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 63

    Metric Clusters

    L ( ) b f i h h

  • 8/13/2019 Slides Chap05

    64/104

    Letku(n, j)be a function that returns thenthoccurrence of termku in documentdj

    Further, letr(ku(n, j), kv(m, j))be a function thatcomputes the distance between

    thenthoccurrence of term ku in documentdj; and

    themthoccurrence of term kv in documentdj

    We define,

    cu,v = djDl

    nm

    1

    r(ku(n, j), kv(m, j))

    In this case the correlation matrix is referred to as alocalmetric matrix

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 64

    Metric Clusters

    N ti th t if k d k i di ti t d t

  • 8/13/2019 Slides Chap05

    65/104

    Notice that ifku andkv are in distinct documents wetake their distance to be infinity

    Variations of the above expression forcu,v have been

    reported in the literature, such as1/r2(ku(n, j), kv(m, j))

    The metric correlation factorcu,v quantifies absoluteinverse distances and is said to be unnormalized

    Thus, the local metric matrixC

    is said to beunnormalized

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 65

    Metric Clusters

    A lt ti i t li th l ti f t

  • 8/13/2019 Slides Chap05

    66/104

    An alternative is to normalize the correlation factor

    For instance,

    cu,v = cu,v

    total number of [ku, kv]pairs considered

    In this case the local metric matrix C is said to benormalized

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 66

    Scalar Clusters

    The correlation between two local terms can also be

  • 8/13/2019 Slides Chap05

    67/104

    The correlation between two local terms can also bedefined by comparing the neighborhoods of the twoterms

    The idea is that two terms with similar neighborhoods

    have some synonymity relationship

    In this case we say that the relationship is indirect or induced by

    the neighborhood

    We can quantify this relationship comparing the neighborhoods of

    the terms through a scalar measure

    For instance, the cosine of the angle between the two vectors is a

    popular scalar similarity measure

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 67

    Scalar Clusters

    Let

  • 8/13/2019 Slides Chap05

    68/104

    Let

    su =(cu,x1 , su,x2 , . . . , su,xn): vector of neighborhood correlation

    values for the termku

    sv =(cv,y1 , cv,y2 , . . . , cv,ym): vector of neighborhood correlation

    values for termkv

    Define,

    cu,v =

    su

    sv|su| |sv|

    In this case the correlation matrix C is referred to as alocal scalar matrix

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 68

    Scalar Clusters

    The local scalar matrix C is said to be induced by the

  • 8/13/2019 Slides Chap05

    69/104

    The local scalar matrix C is said to be induced by theneighborhood

    LetCu(n)be a function that returns thenlargestcu,vvalues in a local scalar matrix C,v

    =u

    Then,Cu(n)defines a scalar cluster around term ku

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 69

    Neighbor Terms

    Terms that belong to clusters associated to the query

  • 8/13/2019 Slides Chap05

    70/104

    Terms that belong to clusters associated to the queryterms can be used to expand the original query

    Such terms are called neighbors of the query terms andare characterized as follows

    A termkv that belongs to a clusterCu(n), associatedwith another termku, is said to be aneighborofku

    Often, neighbor terms represent distinct keywords thatare correlated by the current query context

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 70

    Neighbor Terms

    Consider the problem of expanding a given user query q

  • 8/13/2019 Slides Chap05

    71/104

    Consider the problem of expanding a given user query qwith neighbor terms

    One possibility is to expand the query as follows

    For each termkuq, selectmneighbor terms from theclusterCu(n)and add them to the queryThis can be expressed as follows:

    qm=q {kv|kvCu(n), kuq}Hopefully, the additional neighbor termskv will retrieve

    new relevant documents

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 71

    Neighbor Terms

    The set C (n) might be composed of terms obtained

  • 8/13/2019 Slides Chap05

    72/104

    The setCu(n)might be composed of terms obtainedusing correlation factors normalized and unnormalized

    Query expansion is important because it tends toimprove recall

    However, the larger number of documents to rank alsotends to lower precision

    Thus, query expansion needs to be exercised with greatcare and fine tuned for the collection at hand

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 72

  • 8/13/2019 Slides Chap05

    73/104

    Local Context Analysis

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 73

    Local Context Analysis

    The local clustering techniques are based on the set of

  • 8/13/2019 Slides Chap05

    74/104

    The local clustering techniques are based on the set of

    documents retrieved for a query

    A distinct approach is to search for term correlations inthe whole collection

    Global techniques usually involve the building of athesaurus that encodes term relationships in the wholecollection

    The terms are treated as concepts and the thesaurus isviewed as a concept relationship structure

    The building of a thesaurus usually considers the use of

    small contexts and phrase structures

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 74

    Local Context Analysis

    Local context analysis is an approach that combines

  • 8/13/2019 Slides Chap05

    75/104

    Local context analysis is an approach that combines

    global and local analysis

    It is based on the use ofnoun groups, i.e., a singlenoun, two nouns, or three adjacent nouns in the text

    Noun groups selected from the top ranked documentsare treated as document concepts

    However, instead of documents, passages are used fordetermining term co-occurrences

    Passages are text windows of fixed size

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 75

    Local Context Analysis

    More specifically, the local context analysis procedure

  • 8/13/2019 Slides Chap05

    76/104

    o e spec ca y, t e oca co te t a a ys s p ocedu e

    operates in three steps

    First, retrieve the topnranked passages using the original query

    Second, for each concept cin the passages compute the

    similaritysim(q, c)between the whole query qand the conceptc

    Third, the topmranked concepts, according tosim(q, c), are

    added to the original queryq

    A weight computed as 1 0.9 i/m is assigned toeach conceptc, where

    i: position ofcin the concept rankingm: number of concepts to add to q

    The terms in the original queryqmight be stressed by

    assigning a weight equal to 2 to each of themChap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 76

    Local Context Analysis

    Of these three steps, the second one is the most

  • 8/13/2019 Slides Chap05

    77/104

    p ,

    complex and the one which we now discuss

    The similaritysim(q, c)between each conceptcand theoriginal queryqis computed as follows

    sim(q, c) =

    kiq

    + log(f(c,ki)idfc)logn

    idfi

    wherenis the number of top ranked passagesconsidered

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 77

    Local Context Analysis

    The functionf(c, ki)quantifies the correlation between

  • 8/13/2019 Slides Chap05

    78/104

    f ( , i) q

    the conceptcand the query termki and is given by

    f(c, ki) =n

    j=1

    pfi,j

    pfc,j

    where

    pfi,j is the frequency of termki in thej-th passage; andpfc,j is the frequency of the conceptcin thej-th passage

    Notice that this is the correlation measure defined for

    association clusters, but adapted for passages

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 78

    Local Context Analysis

    The inverse document frequency factors are computed

  • 8/13/2019 Slides Chap05

    79/104

    q y p

    as

    idfi = max(1,log10N/npi

    5 )

    idfc = max(1,log10N/npc5

    )

    where

    Nis the number of passages in the collection

    npi is the number of passages containing the term ki; and

    npc is the number of passages containing the concept c

    Theidfi factor in the exponent is introduced toemphasize infrequent query terms

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 79

    Local Context Analysis

    The procedure above for computingsim(q, c)is a

  • 8/13/2019 Slides Chap05

    80/104

    non-trivial variant of tf-idf ranking

    It has been adjusted for operation with TREC data anddid not work so well with a different collection

    Thus, it is important to have in mind that tuning mightbe required for operation with a different collection

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 80

  • 8/13/2019 Slides Chap05

    81/104

    Implicit Feedback Through Global Analysis

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 81

    Global Context Analysis

    The methods of local analysis extract information from

  • 8/13/2019 Slides Chap05

    82/104

    the local set of documents retrieved to expand the query

    An alternative approach is to expand the query usinginformation from the whole set of documentsa

    strategy usually referred to asglobal analysisprocedures

    We distinguish two global analysis procedures:

    Query expansion based on a similarity thesaurus

    Query expansion based on a statistical thesaurus

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 82

  • 8/13/2019 Slides Chap05

    83/104

    Query Expansion based on a SimilarityThesaurus

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 83

    Similarity Thesaurus

    We now discuss a query expansion model based on a

  • 8/13/2019 Slides Chap05

    84/104

    globalsimilarity thesaurusconstructed automatically

    The similarity thesaurus is based on term to termrelationships rather than on a matrix of co-occurrence

    Special attention is paid to the selection of terms forexpansion and to the reweighting of these terms

    Terms for expansion are selected based on theirsimilarity to the whole query

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 84

    Similarity Thesaurus

    A similarity thesaurus is built using term to term

  • 8/13/2019 Slides Chap05

    85/104

    relationships

    These relationships are derived by considering that theterms are concepts in a concept space

    In this concept space, each term is indexed by thedocuments in which it appears

    Thus, terms assume the original role of documentswhile documents are interpreted as indexing elements

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 85

    Similarity Thesaurus

    Let,

  • 8/13/2019 Slides Chap05

    86/104

    t: number of terms in the collection

    N: number of documents in the collection

    fi,j: frequency of termki in documentdj

    tj: number of distinct index terms in documentdj

    Then,

    itfj = log t

    tj

    is the inverse term frequency for documentdj

    (analogous to inverse document frequency)

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 86

    Similarity Thesaurus

    Within this framework, with each termki is associated a

    i b

  • 8/13/2019 Slides Chap05

    87/104

    vectorki given by

    ki= (wi,1, wi,2, . . . , wi,N)

    These weights are computed as follows

    wi,j =

    (0.5+0.5 fi,j

    maxj(fi,j)) itfjN

    l=1(0.5+0.5 fi,l

    maxl(fi,l))2 itf2j

    wheremaxj(fi,j)computes the maximum of all fi,jfactors for thei-th term

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 87

    Similarity Thesaurus

    The relationship between two termsku andkv is

    t d l ti f t i b

  • 8/13/2019 Slides Chap05

    88/104

    computed as a correlation factorcu,v given by

    cu,v =ku kv = dj

    wu,j wv,j

    The global similarity thesaurus is given by the scalarterm-term matrix composed of correlation factorscu,v

    This global similarity thesaurus has to be computedonly once and can be updated incrementally

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 88

    Similarity Thesaurus

    Given the global similarity thesaurus, query expansion

    i d i th t f ll

  • 8/13/2019 Slides Chap05

    89/104

    is done in three steps as follows

    First, represent the query in the same vector space used for

    representing the index terms

    Second, compute a similaritysim(q, kv)between each termkv

    correlated to the query terms and the whole queryq

    Third, expand the query with the top r ranked terms according to

    sim(q, kv)

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 89

    Similarity Thesaurus

    For the first step, the query is represented by a vector q

    i b

  • 8/13/2019 Slides Chap05

    90/104

    given by

    q=kiq

    wi,qki

    wherewi,qis a term-query weight computed using the

    equation forwi,j, but withqin place of dj

    For the second step, the similarity sim(q, kv)iscomputed as

    sim(q, kv) =q

    kv =

    kiq

    wi,q

    ci,v

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 90

    Similarity Thesaurus

    A termkv might be closer to the whole query centroid

    C than to the individual query terms

  • 8/13/2019 Slides Chap05

    91/104

    qCthan to the individual query terms

    Thus, terms selected here might be distinct from thoseselected by previous global analysis methods

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 91

    Similarity Thesaurus

    For the third step, the top rranked terms are added to

    the query to form the expanded query

  • 8/13/2019 Slides Chap05

    92/104

    the queryqto form the expanded queryqm

    To each expansion termkv in queryqm is assigned aweightwv,qm given by

    wv,qm = sim(q, kv)

    kiqwi,q

    The expanded queryqm is then used to retrieve newdocuments

    This technique has yielded improved retrieval

    performance (in the range of 20%) with three differentcollections

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 92

    Similarity Thesaurus

    Consider a documentdj which is represented in the

    term vector space by d

    w k

  • 8/13/2019 Slides Chap05

    93/104

    term vector space by dj =

    kidj wi,j ki

    Assume that the queryqis expanded to include all thetindex terms (properly weighted) in the collection

    Then, the similaritysim(q, dj)betweendj andqcan becomputed in the term vector space by

    sim(q, dj)

    kvdj

    kuq

    wv,j wu,q cu,v

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 93

    Similarity Thesaurus

    The previous expression is analogous to the similarity

    formula in the generalized vector space model

  • 8/13/2019 Slides Chap05

    94/104

    formula in the generalized vector space model

    Thus, the generalized vector space model can beinterpreted as a query expansion technique

    The two main differences are

    the weights are computed differently

    only the toprranked terms are used

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 94

  • 8/13/2019 Slides Chap05

    95/104

    Query Expansion based on a StatisticalThesaurus

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 95

    Global Statistical Thesaurus

    We now discuss a query expansion technique based on

    a global statistical thesaurus

  • 8/13/2019 Slides Chap05

    96/104

    aglobal statistical thesaurus

    The approach is quite distinct from the one based on asimilarity thesaurus

    The global thesaurus is composed of classes that groupcorrelated terms in the context of the whole collection

    Such correlated terms can then be used to expand the

    original user query

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 96

    Global Statistical Thesaurus

    To be effective, the terms selected for expansion must

    have high term discrimination values

  • 8/13/2019 Slides Chap05

    97/104

    have high term discrimination values

    This implies that they must be low frequency terms

    However, it is difficult to cluster low frequency termsdue to the small amount of information about them

    To circumvent this problem, documents are clusteredinto classes

    The low frequency terms in these documents are thenused to define thesaurus classes

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 97

    Global Statistical Thesaurus

    A document clustering algorithm that produces small

    and tight clusters is the complete link algorithm:

  • 8/13/2019 Slides Chap05

    98/104

    and tight clusters is thecomplete link algorithm:

    1. Initially, place each document in a distinct cluster

    2. Compute the similarity between all pairs of clusters

    3. Determine the pair of clusters[Cu, Cv]with the highest

    inter-cluster similarity

    4. Merge the clustersCu andCv

    5. Verify a stop criterion (if this criterion is not met then go back to

    step 2)

    6. Return a hierarchy of clusters

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 98

    Global Statistical Thesaurus

    The similarity between two clusters is defined as the

    minimum of the similarities between two documents not

  • 8/13/2019 Slides Chap05

    99/104

    minimum of the similarities between two documents notin the same cluster

    To compute the similarity between documents in a pair,

    the cosine formula of the vector model is usedAs a result of this minimality criterion, the resultantclusters tend to be small and tight

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 99

    Global Statistical Thesaurus

    Consider that the whole document collection has been

    clustered using the complete link algorithm

  • 8/13/2019 Slides Chap05

    100/104

    clustered using the complete link algorithm

    Figure below illustrates a portion of the whole clusterhierarchy generated by the complete link algorithm

    Cu

    Cv

    Cz

    0.11

    0.15

    where the inter-cluster similarities are shown in the

    ovalsChap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 100

    Global Statistical Thesaurus

    The terms that compose each class of the global

    thesaurus are selected as follows

  • 8/13/2019 Slides Chap05

    101/104

    thesaurus are selected as follows

    Obtain from the user three parameters:

    TC: threshold class

    NDC: number of documents in a class

    MIDF: minimum inverse document frequency

    Paramenter TC determines the document clusters thatwill be used to generate thesaurus classes

    Two clustersCu andCv are selected, when TC is surpassed by

    sim(Cu, Cv)

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 101

    Global Statistical Thesaurus

    Use NDC as a limit on the number of documents of the

    clusters

  • 8/13/2019 Slides Chap05

    102/104

    s s

    For instance, if bothCu+v andCu+v+z are selected then the

    parameter NDC might be used to decide between the two

    MIDF defines the minimum value of IDF for any termwhich is selected to participate in a thesaurus class

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 102

    Global Statistical Thesaurus

    Given that the thesaurus classes have been built, they

    can be used for query expansion

  • 8/13/2019 Slides Chap05

    103/104

    q y p

    For this, an average term weightwtCfor each thesaurusclassCis computed as follows

    wtC=

    |C|i=1wi,C|C|

    where

    |C|is the number of terms in the thesaurus class C, andwi,Cis a weight associated with the term-class pair [ki, C]

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 103

    Global Statistical Thesaurus

    This average term weight can then be used to compute

    a thesaurus class weightwCas

  • 8/13/2019 Slides Chap05

    104/104

    g

    wC=wtC

    |C| 0.5

    The above weight formulations have been verifiedthrough experimentation and have yielded good results

    Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 104