1 Testing in Information Retrieval Hsin-Hsi Chen.

1

Testing in Information Retrieval

Hsin-Hsi Chen

2

Parameters of Experimental Procedures

• Validitythe extent to which the experiment actually determines what the experimenter wishes to determine.

– Do the observed variables really represent the concepts under investigation? Does a five-point rating scale really measure user satisfaction?

– Does the number of search terms really measure search complexity?

– Does evaluation by a judge other than the user really represent document relevance?

– Can results obtained with student subjects be replicated in a corporate information center?

3

Parameters of Experimental Procedures (Continued)

– Lack of validityorder effects, learning effects, models based on inappropriate axioms, improper operational definitions, observer inaccuracies, and extraneous subject-treatment interaction

• Reliabilitythe extent to which the experimental results can be replicated.

– Will another experimenter get consistent results?

– Lack of validitysmall samples, unequal samples, nonrandom samples, and improper methods of statistical analysis

4

Parameters of Experimental Procedures (Continued)

• Efficiencythe extent to which an experiment is effective (i.e., valid and reliable, relative to the resources consumed)

– validemploy a larger number of variables and more discriminating variables

– reliableincrease the size and representativeness of the database and user and query set

5

Decision 1: To Test or Not To Test

• the purpose of the test

• the addition to knowledge that will result from its execution

• the addition has not already been made

6

Decision 2: What kind of test?

• laboratory testthe sources of variability stemming from users, databases, searchers and search constraints are under the control of the experimenter.

• operational testone or more existing systems - with their own users, databases, searchers, and search constraints - are evaluated or compared.

• differences– operational tests are closer to “real life”, but provides less

specific information

7

Decision 3: How to operationalize the variables

• variablessome attribute or feature - qualitative or quantitative - of a retrieval system

– examples• database• information representation• users• queries• search intermediaries• retrieval process• retrieval evaluation

– roles• independent variable• dependent variable• environmental variable

8

V1. Document collection or database

• databasea collection of documents (document: a package of information created by someone)

• variables– size

• the number of documents

• the number of records

• the storage requirements (in bytes or blocks)

– concentration

9

V1. Document collection or database(Continued)

– form• completeness of the representation (e.g., citation, abstract, full-

text)

• publication vehicle (e.g., monograph, journal article, technical report)

– medium• communication medium (text, sound, table, picture, graph)

• record medium (paper, electronic)

10

V2. Information Representation

• logical structure of the stored information– Boolean model

– vector space model

– relational model

– semantic model

– cluster model

– hypergraph model

– production grammar model

11

V2. Information Representation(Continued)

• physical structure of the store information– b-trees and other multiway trees

– hashing

– signatures

– multilists

• indexing– exhaustivity of indexing

• the number of topics covered by the indexing

• number of index terms/document

12


– Specificity of indexing• the precision of the subject descriptions• number of postings per number

– degree of control in dexing• proportion of free keywords vs controlled vocabulary terms

assigned to documents

– degree of linkage in a vocabulary• number of see also references in the dictionary

– accommodation of vocabulary• extent to which user need not know exact terms• number of see references in the dictionary

13


– term discrimination value• extent to which a term decreases the average similarity of the document set

– degree of pre-coordination of terms• number of index terms per index phrase

– degree of syntactic control• grammatical operators, role operators, relational operators

– accuracy of indexing• number of indexing errors• types of errors: omission and commission

– inter-indexer consistency• ratio of number of terms assigned by both indexers to number of terms

assigned by either

14

V3. Users

• types of userstudent, scientist, business person, child

• context of useoccupational, educational, recreational

• kinds of information neededaid in understanding, define a problem, place a problem in context, design a strategy, complete a solution

• immediacy of information needimmediate, current, future

• ...

15

V4. Queries and search statements

• terms– query: the verbalized statement of a user’s need

• real life: real information needs of a user

• artificial: generate queries from titles and other parts

– search statement: a single string, expressed in the language of the system

• Boolean expression

• vector

• an expression employing some other kind of syntax

• natural language

16

V5. The search process

• Delegated vs end-user searching– degrees of interaction between user and searcher within

delegated searching

• Search logic or techniques used– individual documents (Boolean, vector, extended Boolean,

natural language processing, relevance feedback)

– networks of documents (clustering, browsing, spreading activation)

• Access modes for searchercommand, hypertext links, menus

17

V6. Retrieval Performance

• recallthe proportion of relevant documents retrieved

• precisionthe proportion of retrieved documents that are relevant

• falloutthe proportion of nonrelevant items retrieved

18

V6. Retrieval Performance (Continued)• issue 1

how the states of being retrieved and being relevant are determined?– What units are retrieved and evaluated as to relevance - full papers,

sections, paragraphs, sentences?

• issue 2if output is ranked, then recall and precision will depend on the stopping point– Recall and precision can be calculated at each rank.

n: the size of the relevant document set,N: the size of the total document set,ri: the retrieval rank of the ith relevant document

Rr i

n N nnorm

i

1

( )P

r i

n N n nnorm

i

1

log( ) log( )

log[ !/ (( )! !)]

19

V6. Retrieval Performance (Continued)

• issue 3combine recall and precision into an overall retrieval performance measure

– E-measure: E=1-1/[P-1+(1- )R-1]

– MZ metric: D=1-1/(P-1+ R-1 -1)

• issue 4how to determine the relevant set?

– The relevant set is predetermined by some means.• Taking the title of a paper as the query and the cited documents as

the relevant set.• There may be other documents in the database that are also relevant.

20


– A small document set is used for the test, and the relevance of all documents for all queries is assessed by users or system personnel.

• Small files may not be very reliable.

– A random sample of the nonretrieved set is taken and all documents in the set assessed as to relevance.

• The size of the relevant set is small compared to the size of the database, and hence a very large sample will be needed.

– In comparative tests, relative rather than absolute recall is calculated.

ai: the set of documents retrieved by the ith treatmentN(a): the number of documents in a set a

N aU N a

i

j j

( )( )

21


• Four dimensions to the effectiveness– How informative was the retrieved set? (precision)

– How complete was the retrieved set? (recall)

– How much time did the user spend with the system? (contact time)

– Was the experience with the system a satisfying one? (user friendly)

22

Decision 4. What database to use?

• experimental vs operational database• decisions to development an operational database

– coverage of the database-subject, time, period, language

– source of documents

– source of vocabulary, use of authority files

– form of documents-full-test, abstracts, citations

– fields of record or records of each document

– display of formats, ordering of records in displays

– windowing capabilities

23

Decision 4. What database to use?(Continued)

• popular test collection– the Cranfield– Medlars– Communications of the ACM– the Ontap ERIC

24

Decision 5. Where to get queries?

• real users– difficult to control and difficult to involve in the search process

and evaluation in a predetermined fashion

• artificial queries– use the title of a paper as the query and the references cited as the

set of relevant answers

– the records of past queries

– problem: do not present the information need of a person involved in the test

• other users– inconsistencies may result

25

Decision 6: How to process queries

26

Decision 7. How will treatments be assigned to experimental units?

27

The TREC Conferences

28

Introduction

• TREC-1 (Text Retrieval Conference)November 1992

• TECC-2August 1993

• TREC-3

29

Task

(ad hoc query)

(routingquery)

(trainingquery)

150 topics 50 topics

2 gigabytes 1 gigabytes

30

Categories of Query Construction

• AUTOMATICcompletely automatic initial query construction

• MANUALmanual initial construction

• INTERACTIVEuse of interactive techniques to construct the queries

31

Levels of Participation

• Category A: full participation• Category B:

full participation using a reduced database• Category C: evaluation only• submit up to two runs for routing task, the adhoc

task, or both• send in the top 1000 documents retrieved for each

topic for evaluation

32

TREC-3 Participants(14 companies, 19 universities)

33

The Test Collection

• the documents

• the queries or topics

• the relevant judgements (right answers)

34

The Documents

• Disk 1 (1GB)– WSJ: Wall Street Journal (1987, 1988, 1989)

– AP: AP Newswire (1989)

– ZIFF: Articles from Computer Select disks (Ziff-Davis Publishing)

– FR: Federal Register (1989)

– DOE: Short abstracts from DOE publications

• Disk2 (1GB)– WSJ: Wall Street Journal (1990, 1991, 1992)

– AP AP Newswire (1988)

– ZIFF: Articles from Computer Select disks

– FR: Federal Register (1988)

35

The Documents (Continued)

• Disk 3 (1 GB)– SJMN: San Jose Mercury News (1991)

– AP: AP Newswire (1990)

– ZIFF: Articles from Computer Select disks

– PAT: U.S. Patents (1993)

• Statistics– document lengths

DOE (very short documents) vs. FR (very long documents)

– range of document lengthsAP (similar in length) vs. WSJ and ZIFF (wider range of lengths)

36

Document Format

<DOC><DOCNO>WSJ880406-0090</DOCNO><HL>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan </HL><AUTHOR>Janet Guyon (WSJ staff) </AUTHOR><DATELINE>New York</DATELINE><TEXT> American Telephone & Telegraph Co. introduced the first of a new generation of phone services with broad implications for computer and communications . .</TEXT></DOC>

37

The Topics

• Issue 1– allow a wide range of query construction methods

– keep the topic (user need) distinct from the query (the actual text submitted to the system)

• Issue 2– increase the amount of information available about

each topic

– include with each topic a clear statement of what criteria make a document relevant

38

Sample Topics used in TREC-1 and TREC-2

<top><head>Tipster Topic Description<num>Number: 066<dom>Domain: Science and Technology<title>Topic: Natural Language Processing

<desc>Description: (one sentence description)Document will identify a type of natural language processing technology whichis being developed or marketed in the U.S.

<narr>Narrative: (complete description of document relevance for assessors)A relevant document will identify a company or institution developing ormarketing a natural language processing technology, identify the technology,and identify one or more features of the company’s product.

<con>Concepts: (a mini-knowledge base about topic such as a real searcher 1. natural language processing might possess)2. translation, language, dictionary, font3. software applications

39

<fac> Factors (allow easier automatic query building by listing specific<nat> Nationality: U.S. items from the narrative that </fact> constraint the documents that <def>Definition(s): are relevant)</top>

40

Sample Topics used in TREC-3

<num>Number: 168<title>Topic: Financing AMTRAK

<desc>Description:A document will address the role of the Federal Government in financingthe operation of the National Railroad Transportation Corporation (AMTRAK)

<narr>Narrative:A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity.It could also discuss the privatization of AMTRAK as an alternative tocontinuing government subsides. Document comparing government subsidesgiven to air and bus transportation with those provided to AMTRAK would alsobe relevant.

41

Features of topics in TREC-3

• The topics are shorter.

• The topics miss the complex structure of the earlier topics.

• The concept field has been removed.

• The topics were written by the same group of users that didassessments.

• Summary:– TREC-1 and 2 (1-150): suited to the routing task

– TREC-3 (151-200): suited to the ad-hoc task

42

The Relevance Judgements• For each topic, compile a list of relevant documents.

• approaches– full relevance judgements (impossible)

judge over 1M documents for each topic, result in 100M judgements

– random sample of documents (insufficient relevance sample)relevance judgements done on the random sample only

– TREC approach (pooling method)make relevance judgements on the sample of documents selected byvarious participating systemsassumption: the vast majority of relevant documents have been found andthat documents that have not been judged can be assumed to be no relevant

• pooling method– Take the top 100 documents retrieved by each system for a given topic.

– Merge them into a pool for relevance assessment.

– The sample is given to human assessors for relevance judgements.

43

Analysis of Completeness of Relevant Judgments

44

Overlap of Submitted Results

TREC-1 (TREC-2): top 100 documents for each run (33 runs & 40 runs)TREC-3: top 100 (200) documents for each run (48 runs)After pooling, each topic was judged by a single assessor to insure the best consistency of judgement.

TREC-1 和 TREC-2 runs 的個數差 7 個，檢索所得的 unique documents 個數(39% vs. 28%) 差異不大，經人判定相關的文件數目差異也不大 (22% vs. 19%) 。TREC-3 提供判斷的文件取兩倍大， unique 部份差異不大 (21% vs. 20%) ，經經人判定相關的文件數目差異也不大 (15% vs. 10%) 。

unique

45

Evaluation

46

An Evaluation of Query Processing Strategies Using the Tipster Collection

James P. Callan and W. Bruce Croft

47

INQUERY Information Retrieval System

• Documents are indexed by the word stems and numbers that occur in the text.

• Documents are also indexed automatically by a small number of features that provide a controlled indexing vocabulary.

• When a document refers to a company by name, the document is indexed by the company name and the feature #company.

• INQUERY includes company, country, U.S. city, number and date, and person name recognizer.

48

INQUERY Information Retrieval System

• feature operators#company operator matches the #company feature

• proximity operatorsrequire their arguments to occur either in order, within some distance of each other, or within some window

• belief operatorsuse the maximum, sum, or weighted sum of a set of beliefs

• synonym operators

• Boolean operators

49

Query Transformation in INQUERY

• Discard stop phrases.

• Recognize phrases by stochastic part of speech tagger.

• Look for word “not” in the query.

• Recognize proper names by assuming that a sequence of capitalized words is a proper name.

• Introduce synonyms by a small set of words that occur in the Factors field of TIPSTER topics.

• Introduce controlled vocabulary terms (feature operators).

50

Techniques for Creating Ad Hoc Queries

• Simple Queries (description-only approach)– Use the contents of Description field of TIPSTER topics only.– Explore how the system behaves with the very short queries.

• Multiple Sources of Information (multiple-field approach)– Use the contents of the Description, Title, Narrative, Concept(s) and

Factor(s) fields.– Explore how a system might behave with an elaborate user interface

or very sophisticated query processing

• Interactive Query Creation– Automatic query creation followed by simple manual modifications.– Simulate simple user interaction with the query processing.

51

Simple Queries

• A query is constructed automatically by employing all the query processing transformations on Description field.

• The remaining words and operators are enclosed in a weighted sum operator.

• 11-point

52

53

Multiple Sources of Information

• Q-1: Created automatically, using T, D, N, C and F fields. Everything except the synonym and concept operators was discarded from the the Narrative field. (baseline model)

• Q-3: The same as Q-1, except that recognition of phrases and proper names was disabled. (words-only query)To determine whether phrase and proximity operators were helpful.

• Q-4: The same as Q-1, except that recognition of phrases was applied to the Narrative field.To determine whether the simple query processing transformation would be effective on the abstract descriptions in the Narrative field.

54

55

Multiple Sources of Information (Continued)

• Q-6: The same as Q-1, except that only the T, C, and F fields were used.Narrow in on the set of fields that appeared most useful.

• Q-F: The same as Q-1, with 5 additional thesaurus words or phrases added automatically to each queryan approach to automatically discovering thesaurus terms

• Q-7: A combination of Q-1 and Q-6whether combining the results of two relatively similar queries could yield an improvement

56

A Comparison of Six Automatic Methods of Constructing AdHoc Queries

Phrases improved performanceat low recall

Phrases from the Narrativewere not helpful.

Discarding the Descriptionand Narrative fields did nothurt performance appreciably.

It is possible to automaticallyconstruct a useful thesaurus for

a collection.

Q-1 and Q-6, which aresimilar, retrieve different

sets of documents.

57

Interactive Query Creation• The system created a query using method Q-1, and then a person

was permitted to modify the resulting query.

• Modifications– add words from the Narrative field– delete words or phrases from the query– indicate that certain words or phrases should occur near each other within a

document

• Q-MManual addition of words or phrases from the Narrative, and manual deletion of words or phrases from the query

• Q-OThe same as Q-M, except that the user could also indicate that certain words or phrases must occur within 50 words of each other

58

Recall levels(10%-60%)acceptable becauseusers are not likelyto examine alldocuments retrieved

Paragraph retrieval (within 50words)significantlyimproves effectiveness

59

The effects of thesaurus terms and phrases on queriesthat were created automatically and modified manually

Cf. Q-O (42.7)Thesaurus words and phraseswere added after the querywas modified, so they werenot used in unordered windowoperators

Inclusion of unorderedwindow operators

60

Techniques for Creating Routing Queries