© Tefko Saracevic, Rutgers University1 1.Discussion 2.Information retrieval (IR) model (the traditional models). 3. The review of the readings. Announcement.

© Tefko Saracevic, Rutgers University

1

1. Discussion

2. Information retrieval (IR) model (the traditional models).

3. The review of the readings.

Announcement Feb. 3, 2003


2

Information retrieval (IR):traditional model

Definition of IR

System & user components

Exact match & best match searches

Strengths & weaknesses of the two match models


3

IR: problems addressed - original definition

Calvin Mooers first introduced this term, “information retrieval”, into the literature of documentation in 1950. (Swanson, 1988)

“Inf. retrieval embraces the intellectual aspects of the description of information and its specification for search, and also whatever systems, techniques, or machines are employed to carry out the operation.”

Calvin Mooers, 1951


4

IR: another definition

• “Information retrieval is often regarded as being synonymous with document retrieval and nowadays, with text retrieval, implying that the task of an IR system is to retrieve documents or texts with information content that is relevant to a user’s information need” (Spark Jones & Willett, 1997)


5

IR:Objective & problems

Provide the users with effective access to & interaction with information resources.

Problems addressed:

1. How to organize information intellectually?

2. How to specify search & interaction intellectually?

3. What systems & techniques to use for those processes?


6

IR models

• Model depicts, represents what is involved - a choice of features, processes, things for consideration

• Several IR models used over time– traditional: oldest, most used, shows

basic elements involved– interactive: more realistic, favored now,

shows also interactions involved; several models proposed

• Each has strengths, weaknesses

• We start with traditional model to illustrate many points - from general to specific examples


7

Traditional IR model

• The classic information retrieval model (Bates, 1989)

Document

Document representation

Query

Informationneed

Match


8


• The “standard” IR model (Belkin, 1993)

Information need

Representation

Query

Texts

Representation

Surrogate

Comparison

Retrieval Texts

Judgment

Modification


9

File organizationindexed documents

Acquisitiondocuments, objects

Representationindexing, ...

Probleminformation need

Representationquestion

Querysearch formulation

Matchingsearching

Retrieved objects

feedba

ck


System User


10

A few question about the traditional models

• 1. What is the similarity and difference between these three models?

• 2. What do you learn about IR from them?

• 3. What is the weaknesses and strengths of traditional IR model? If possible, critique these models combining your own experience.


11

• Content: What is in databases– In DIALOG first part of blue sheets:

File Description, Subject Coverage

• Selection of documents & other objects from various sources– In blue sheets: Sources

• Mostly text based documents– Full texts, titles, abstracts ...– But also: data, statistics, images

(e.g. maps, trade marks) ...

Acquisition(system)

Importance:Determines contents of databases Key to file

selection !!!


12

• Indexing :– controlled vocabulary - thesaurus– free text terms (even in full texts)

• Abstracting; annotating

• Bibliographic description:– author, title, source, date…metadata

• Classifying, clustering, ranking– Basic Index, Additional Index. Limits

• Organization in fields & limits

• Manual & automatic techniques– advantages & disadvantages

Representationof documents, objects

(system)

Basic to what is available for searching & displaying


13

• Sequential – record (document) by record

• Inverted – term by term; list of records under

each term

• Combination: indexes inverted, documents sequential

• When citation retrieved only, need for document files

• Large file approaches– for efficient retrieval by computers

File organization(system)

Enables searching & interplay


14

• Related to task situation at hand

• Vary in specificity, clarity

• Produces information need

• Ultimate criterion for effectiveness of retrieval

• Inf. need for the same problem may change, evolve, shift during the IR process - adjustment in searching

• Often more than one search for same problem over time

Problem(user)

Critical for examination in interview


15

• A question:

• Why information need for the same problem may change? Do you have this experience? Tell us your story.

Problem(user)


16

• Non-mediated: end user alone

• Mediated: intermediary + user– interviews; human-human interaction

• Question analysis: selection, elaboration of terms

• Focus toward search terms & logic; selection of databases

• Subject to feedback changes

• Various tools: thesaurus ...

• Roles of intermediary

Representation - question( user & possibly system)

Determines contentsof searching - dynamic


17

• Translation into systems requirements & limits – start of human-computer interaction

• Selection of databases

• Search strategy - selection of:– search terms & logic– possible fields, delimiters – controlled & uncontrolled vocabulary– variations in effectiveness tactics

• Reiterations from feedback – several feedback types: relevance

feedback, magnitude feedback ...– query expansion & modification

Query - search statement(user & system)

What & how of actual searching


18

• Process of matching, comparing– search: what documents in the file

match the query as stated?

• Various search algorithms:– exact match - Boolean

• still most prevalent

– best match - ranking by relevance

• increasingly used e.g. on the web

– hybrids incorporating both

• e.g. Target, Rank in DIALOG

• Each has strengths, weaknesses– no ‘perfect’ method exists

Matching - searching(user & system)

Search interactions


19

• Various order of output:– Last In First Out (LIFO); sorted– ranked by relevance– ranked by other characteristics

• Various forms of output– In DIALOG: Output options

• When citations only: linkage to document delivery

• Base for relevance, utility evaluation by users

• Relevance feedback

Retrieved documents(from system to user)

What a user sees, gets, judges


20

Exact match - Boolean search

• You retrieve exactly what you ask for in the query:– all documents that have the term(s)

with logical connection(s), and possible other restrictions (e.g. to be in titles) as stated in the query

– exactly: nothing less, nothing more

• Based on matching following rules of Boolean algebra, or algebra of sets– ‘new algebra’

– presented by circles in Venn diagrams


21

Boolean algebra & Venn diagrams

Four basic operations:

1 2 3

A BA alone. All documents that have A. Shade 1 & 2. E.G. apples

1 2 3

A B

A AND B. Shade 2

apples AND oranges

1 2 3

A B

A OR B. Shade 1, 2, 3

apples OR oranges

1 2 3

A B

A NOT B. Shade 1

apples NOT oranges


22

Venn diagrams … cont.

Complex statements allowed e.g

12

3

4 5 6

7

A B

C

(A OR B) AND C

Shade 4,5,6

(apples or oranges) AND Florida

(A OR B) NOT C

Shade what?

(apples or oranges NOT Florida


23

Venn diagrams cont.

• Complex statements can be made– as in ordinary algebra e.g. (2+3)x4

• As in ordinary algebra: watch for parenthesis:– 2+(3 x 4) is not the same as

(2+3)x4– (A AND B) OR C not the same as

A AND (B OR C)


24

Best match searching

• You retrieve documents ranked by how similar (close) they are to a query (as calculated by the system)– similarity assumed as relevance– thus, documents as answers are presented from

those that are most likely relevant downwards to less & less likely relevant - can be cut at any desired number - e.g. first 10

• Algorithms (formulas) used to determine similarity– using statistic &/or linguistic properties

• Web outputs are mostly ranked

• But DIALOG allows ranking as well, with special commands


25

Best match ... cont.

• Best match process:– compares a set of query terms with the

sets of terms in documents– calculates a similarity between query &

each document based on common terms– sorts the documents in order of similarity– assumes that the higher ranked

documents have a higher probability of being relevant

– allows for cut-off at a chosen number

• BIG issue: What representation & similarity measures are best?– considerable research & many tests– many proprietary algorithms


26

Boolean vs. best match

• Boolean– allows for logic– provides all that

has been matched

BUT– has no particular

order of output– treats all

retrievals equally - from the most to least relevant ones

– often requires examination of large outputs

• Best match– allows for free

terminology– provides for a

ranked output– provides for cut-

off - any size output

BUT– does not include

logic– ranking method

(algorithm) not transparent

• whose relevance?

– where to cut off?


27

Boolean vs. best match

• Questions about best match (just thinking).

• 1. If you are a user, do you believe the judgment of algorithm if you do not read the hits?

• 2. Is it definitely that a document which is judged only 10% relevant to your query is less useful for resolving your information problem than a 40% relevant one?


28

Strengths of traditional IR model

• Lists major components in both system & user branches

• Suggests:– What to explain to users about

system, if needed– What to ask of users for more effective

searching (problem ...)

• Selection of component(s) for concentration– mostly ever better representation

• Provides a framework for evaluation of (static) aspects


29

Weaknesses

• Does not address nor account for interaction & judgment of results by users– identifies interaction with search only– interaction is a much richer process

• Many types of & variables in interaction not reflected

• Feedback has many types & functions - also not shown

• Evaluation thus one-sided

IR is a highly interactive process- thus additional model(s) needed

© Tefko Saracevic, Rutgers University1 1.Discussion 2.Information retrieval (IR) model (the traditional models). 3. The review of the readings. Announcement.

Documents

information retrieval

tefko saracevic

rutgers university4

rutgers university5

rutgers university3

traditional models

users information

description of information