Top Banner
2002.10.22 - SLIDE 1 IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002 http://www.sims.berkeley.edu/academics/courses/ is202/f02/ SIMS 202: Information Organization and Retrieval Lecture 15: Intro to Information Retrieval
64

2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 1IS 202 – FALL 2002

Prof. Ray Larson & Prof. Marc Davis

UC Berkeley SIMS

Tuesday and Thursday 10:30 am - 12:00 pm

Fall 2002http://www.sims.berkeley.edu/academics/courses/is202/f02/

SIMS 202:

Information Organization

and Retrieval

Lecture 15: Intro to Information Retrieval

Page 2: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 2IS 202 – FALL 2002

Lecture Overview

• Review

– Database Design

– Normalization

– Web-enabled Databases

• Introduction to Information Retrieval

• The Information Seeking Process

• Information Retrieval History and Developments

Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

Page 3: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 3IS 202 – FALL 2002

Models (1)

ConceptualModel

LogicalModel

External Model

Conceptual requirements

Conceptual requirements

Conceptual requirements

Conceptual requirements

Application 1

Application 1

Application 2 Application 3 Application 4

Application 2

Application 3

Application 4

External Model

External Model

External Model

Internal Model

Page 4: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 4IS 202 – FALL 2002

Database System Life Cycle

Growth,Change, &

Maintenance6

Operations5

Integration4

Design1

Conversion3

PhysicalCreation

2

Page 5: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 5IS 202 – FALL 2002

Normal Forms

• First Normal Form (1NF)

• Second Normal Form (2NF)

• Third Normal Form (3NF)

• Boyce-Codd Normal Form (BCNF)

• Fourth Normal Form (4NF)

• Fifth Normal Form (5NF)

Page 6: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 6IS 202 – FALL 2002

Normalization

Boyce-Codd and

Higher

Functional dependencyof nonkey attributes on the primary key - Atomic values only

Full Functional dependencyof nonkey attributes on the primary key

No transitive dependency between nonkey attributes

All determinants are candidate keys - Single multivalued dependency

Page 7: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 7IS 202 – FALL 2002

Dynamic Web Applications 2

Server

database

CGI

DBMS

Web Server

Internet

Files

Clients

database

database

Page 8: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 8IS 202 – FALL 2002

Server Interfaces

Adapted from John P. Ashenfelter, Choosing a Database for Your Web Site

DatabaseWeb Server

Web ApplicationServer

Web DBApp

HTML

JavaScript

DHTML

CGI

Web Server API’s

ColdFusion PhP Perl

Java ASP

SQL

ODBCNative DBinterfaces JDBC

Native DB

Interfaces

Page 9: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 9IS 202 – FALL 2002

Photo Browser

• The current photo browser uses a combination of – Javascript for expandable hierarchies– Database in MS Access– ColdFusion to search the database when one

of the facets is selected

• The database design for the photo database currently looks like…

Page 10: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 10IS 202 – FALL 2002

Photo Browser ER

Page 11: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 11IS 202 – FALL 2002

Photo Database

• Lets look at the photo database in the Access interface– Multi-Facet queries– Queries for multiple descriptors in the same

facet (harder)

Page 12: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 12IS 202 – FALL 2002

Lecture Overview

• Review

– Database Design

– Normalization

– Web-enabled Databases

• Introduction to Information Retrieval

• The Information Seeking Process

• Information Retrieval History and Developments

Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

Page 13: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 13IS 202 – FALL 2002

Review: Information Overload

• “The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.” (Varian & Lyman)

• “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)

Page 14: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 14IS 202 – FALL 2002

Course Outline

• Organization– Overview– Categorization– Metadata and markup– Metadata for multimedia

• Photo Project

– Controlled vocabularies, classification, thesauri

– Information design• Thesaurus design

• Database design

• Retrieval– The search process– Content analysis

• Tokenization, Zipf’s law, lexical associations

– IR implementation– Term weighting and

document ranking• Vector space model

– User interfaces• Overviews, query

specification, providing context

Page 15: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 15IS 202 – FALL 2002

Key Issues In This Course

• How to describe information resources or information-bearing objects in ways so that they may be effectively used by those who need to use them– Organizing

• How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs– Retrieving

Page 16: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 16IS 202 – FALL 2002

Key Issues

Creation

Utilization Searching

Active

Inactive

Semi-Active

Retention/Mining

Disposition

Discard

Using Creating

AuthoringModifying

OrganizingIndexing

StoringRetrieval

DistributionNetworking

AccessingFiltering

Page 17: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 17IS 202 – FALL 2002

IR Textbook Topics

Page 18: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 18IS 202 – FALL 2002

More

Deta

iled

Vie

w

Page 19: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 19IS 202 – FALL 2002

What

We’ll C

over

A Lot

A Little

Page 20: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 20IS 202 – FALL 2002

IR Topics for 202

• The Search Process• Information Retrieval Models• Content Analysis/Zipf Distributions• Evaluation of IR Systems

– Precision/Recall– Relevance– User Studies

• System and Implementation Issues• Web-Specific Issues• User Interface Issues• Special Kinds of Search

Page 21: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 21IS 202 – FALL 2002

Lecture Overview

• Review

– Database Design

– Normalization

– Web-enabled Databases

• Introduction to Information Retrieval

• The Information Seeking Process

• Information Retrieval History and Developments

Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

Page 22: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 22IS 202 – FALL 2002

The Standard Retrieval Interaction Model

Page 23: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 23IS 202 – FALL 2002

Standard Model of IR

• Assumptions:– Maximizing precision and recall

simultaneously– The information need remains static– The value is in the resulting document set

Page 24: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 24IS 202 – FALL 2002

Problems with Standard Model

• Users learn during the search process:– Scanning titles of retrieved documents– Reading retrieved documents– Viewing lists of related topics/thesaurus terms– Navigating hyperlinks

• Some users don’t like long disorganized lists of documents

Page 25: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 25IS 202 – FALL 2002

IR is an Iterative Process

Repositories

Workspace

Goals

Page 26: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 26IS 202 – FALL 2002

IR is a Dialog

• The exchange doesn’t end with first answer• User can recognize elements of a useful answer• Questions and understanding changes as the

process continues

Page 27: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 27IS 202 – FALL 2002

Bates’ “Berry-Picking” Model

• Standard IR model– Assumes the information need remains the

same throughout the search process

• Berry-picking model– Interesting information is scattered like berries

among bushes– The query is continually shifting

Page 28: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 28IS 202 – FALL 2002

Berry-Picking Model

Q0

Q1

Q2

Q3

Q4

Q5

A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89)

Page 29: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 29IS 202 – FALL 2002

Berry-Picking Model (cont.)

• The query is continually shifting

• New information may yield new ideas and new directions

• The information need– Is not satisfied by a single, final retrieved set– Is satisfied by a series of selections and bits

of information found along the way

Page 30: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 30IS 202 – FALL 2002

Information Seeking Behavior

• Two parts of a process:– Search and retrieval – Analysis and synthesis of search results

• This is a fuzzy area– We will look at several different working

theories

Page 31: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 31IS 202 – FALL 2002

Search Tactics and Strategies

• Search Tactics– Bates 1979

• Search Strategies– Bates 1989– O’Day and Jeffries 1993

Page 32: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 32IS 202 – FALL 2002

Tactics vs. Strategies

• Tactic: short term goals and maneuvers– Operators, actions

• Strategy: overall planning– Link a sequence of operators together to

achieve some end

Page 33: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 33IS 202 – FALL 2002

Information Search Tactics

• Monitoring tactics– Keep search on track

• Source-level tactics– Navigate to and within sources

• Term and Search Formulation tactics– Designing search formulation– Selection and revision of specific terms within

search formulation

Page 34: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 34IS 202 – FALL 2002

Monitoring Tactics (Strategy-Level)

• Check– Compare original goal with current state

• Weigh– Make a cost/benefit analysis of current or

anticipated actions

• Pattern– Recognize common strategies

• Correct Errors• Record

– Keep track of (incomplete) paths

Page 35: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 35IS 202 – FALL 2002

Source-Level Tactics

• “Bibble”:– Look for a pre-defined result set

• E.g., a good link page on web

• Survey:– Look ahead, review available options

• E.g., don’t simply use the first term or first source that comes to mind

• Cut:– Eliminate large proportion of search domain

• E.g., search on rarest term first

Page 36: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 36IS 202 – FALL 2002

Source-Level Tactics (cont.)

• Stretch– Use source in unintended way

• E.g., use patents to find addresses

• Scaffold– Take an indirect route to goal

• E.g., when looking for references to obscure poet, look up contemporaries

• Cleave– Binary search in an ordered file

Page 37: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 37IS 202 – FALL 2002

Search Formulation Tactics

• Specify– Use as specific terms as possible

• Exhaust– Use all possible elements in a query

• Reduce– Subtract elements from a query

• Parallel– Use synonyms and parallel terms

• Pinpoint– Reducing parallel terms and refocusing query

• Block– To reject or block some terms, even at the cost of

losing some relevant documents

Page 38: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 38IS 202 – FALL 2002

Term Tactics

• Move around the thesaurus– Superordinate, subordinate, coordinate – Neighbor (semantic or alphabetic)– Trace – pull out terms from information

already seen as part of search (titles, etc.)– Morphological and other spelling variants– Antonyms (contrary)

Page 39: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 39IS 202 – FALL 2002

Additional Considerations (Bates 79)

• Add a Sort tactic!• More detail is needed about short-term

cost/benefit decision rule strategies• When to stop?

– How to judge when enough information has been gathered?

– How to decide when to give up an unsuccessful search?

– When to stop searching in one source and move to another?

Page 40: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 40IS 202 – FALL 2002

Implications

• Interfaces should make it easy to store intermediate results

• Interfaces should make it easy to follow trails with unanticipated results

• Makes evaluation more difficult

Page 41: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 41IS 202 – FALL 2002

• Later in the course:– More on Search Process and Strategies– User interfaces to improve IR process– Incorporation of Content Analysis into better

systems

More Later…

Page 42: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 42IS 202 – FALL 2002

Restricted Form of the IR Problem

• The system has available only pre-existing, “canned” text passages

• Its response is limited to selecting from these passages and presenting them to the user

• It must select, say, 10 or 20 passages out of millions or billions!

Page 43: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 43IS 202 – FALL 2002

Information Retrieval

• Revised Task Statement:

Build a system that retrieves documents that users are likely to find relevant to their queries

• This set of assumptions underlies the field of Information Retrieval

Page 44: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 44IS 202 – FALL 2002

Lecture Overview

• Review

– Database Design

– Normalization

– Web-enabled Databases

• Introduction to Information Retrieval

• The Information Seeking Process

• Information Retrieval History and Developments

Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

Page 45: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 45IS 202 – FALL 2002

Visions of IR Systems

• Paul Otlet, 1930’s

• Emanuel Goldberg, 1920’s - 1940’s

• H.G. Wells, “World Brain: The idea of a permanent World Encyclopedia.” (Introduction to the Encyclopedie Francaise), 1937.

• Vannevar Bush, “As we may think.” Atlantic Monthly, 1945.

Page 46: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 46IS 202 – FALL 2002

Card-Based IR Systems

• Uniterm (Casey, Perry, Berry, Kent: 1958)– Developed and used from mid 1940’s)

EXCURSION 43821 90 241 52 63 34 25 66 17 58 49130 281 92 83 44 75 86 57 88 119640 122 93 104 115 146 97 158 139870 342 157 178 199 207 248 269 298

LUNAR 12457110 181 12 73 44 15 46 7 28 39430 241 42 113 74 85 76 17 78 79820 761 602 233 134 95 136 37 118 109 901 982 194 165 127 198 179 377 288 407

Page 47: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 47IS 202 – FALL 2002

Card Systems

• Batten Optical Coincidence Cards (“Peek-a-Boo Cards”), 1948

Lunar

Excursion

Page 48: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 48IS 202 – FALL 2002

Card Systems

• Zatocode (edge-notched cards) Mooers, 1951

Document 1 Title: lksd ksdj sjd sjsjfkl Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe

Document 200 Title: Xksd Lunar sjd sjsjfkl Author: Jones, R. Abstract: Lunar uejm jshy ksd jh uyw hhy jha jsyhe

Document 34 Title: lksd ksdj sjd Lunar Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe

Page 49: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 49IS 202 – FALL 2002

Computer-Based Systems

• Bagley’s 1951 MS thesis from MIT suggested that searching 50 million item records, each containing 30 index terms would take approximately 41,700 hours – Due to the need to move and shift the text in

core memory while carrying out the comparisons

• 1957 – Desk Set with Katharine Hepburn and Spencer Tracy – EMERAC

Page 50: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 50IS 202 – FALL 2002

Historical Milestones in IR Research

• 1958 Statistic Language Properties (Luhn)• 1960 Probabilistic Indexing (Maron & Kuhns)• 1961 Term association and clustering (Doyle)• 1965 Vector Space Model (Salton)• 1968 Query expansion (Roccio, Salton)• 1972 Statistical Weighting (Sparck-Jones)• 1975 2-Poisson Model (Harter, Bookstein,

Swanson)• 1976 Relevance Weighting (Robertson, Sparck-

Jones)• 1980 Fuzzy sets (Bookstein)• 1981 Probability without training (Croft)

Page 51: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 51IS 202 – FALL 2002

Historical Milestones in IR Research (cont.)

• 1983 Linear Regression (Fox)• 1983 Probabilistic Dependence (Salton, Yu)• 1985 Generalized Vector Space Model (Wong,

Rhagavan)• 1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et

al.)• 1990 Latent Semantic Indexing (Dumais,

Deerwester)• 1991 Polynomial & Logistic Regression (Cooper,

Gey, Fuhr)• 1992 TREC (Harman)• 1992 Inference networks (Turtle, Croft)• 1994 Neural networks (Kwok)

Page 52: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 52IS 202 – FALL 2002

Development of Bibliographic Databases

• Chemical Abstracts Service first produced “Chemical Titles” by computer in 1961

• Index Medicus from the National Library of Medicine soon followed with the creation of the MEDLARS database in 1961

• By 1970, most secondary publications (indexes, abstract journals, etc.) were produced by machine

Page 53: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 53IS 202 – FALL 2002

Boolean IR Systems

• Synthex at SDC, 1960• Project MAC at MIT, 1963 (interactive)• BOLD at SDC, 1964 (Harold Borko)• 1964 New York World’s Fair – Becker and

Hayes produced system to answer questions (based on airline reservation equipment)

• SDC began production for a commercial service in 1967 – ORBIT

• NASA-RECON (1966) becomes DIALOG• 1972 Data Central/Mead introduced LEXIS –

Full text• Online catalogs – late 1970’s and 1980’s

Page 54: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 54IS 202 – FALL 2002

Experimental IR systems

• Probabilistic indexing – Maron and Kuhns, 1960

• SMART – Gerard Salton at Cornell – Vector space model, 1970’s

• SIRE at Syracuse

• I3R – Croft

• TREC – 1992

Page 55: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 55IS 202 – FALL 2002

The Internet and the WWW

• Gopher, Archie, Veronica, WAIS

• Tim Berners-Lee, 1991 creates WWW at CERN – originally hypertext only

• Web-crawler

• Lycos

• Alta Vista

• Inktomi

• Google

Page 56: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 56IS 202 – FALL 2002

Information Retrieval – Historical View

• Boolean model, statistics of language (1950’s)

• Vector space model, probablistic indexing, relevance feedback (1960’s)

• Probabilistic querying (1970’s)

• Fuzzy set/logic, evidential reasoning (1980’s)

• Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s)

• DIALOG, Lexus-Nexus, • STAIRS (Boolean based) • Information industry

(O($B))• Verity TOPIC (fuzzy logic)• Internet search engines

(O($100B?)) (vector space, probabilistic)

Research Industry

Page 57: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 57IS 202 – FALL 2002

Research Sources in Information Retrieval

• ACM Transactions on Information Systems• Am. Society for Information Science Journal• Document Analysis and IR Proceedings (Las

Vegas)• Information Processing and Management

(Pergammon)• Journal of Documentation• SIGIR Conference Proceedings• TREC Conference Proceedings

Page 58: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 58IS 202 – FALL 2002

Research Systems Software

• INQUERY (Croft/U. Mass)

• OKAPI (Robertson)

• PRISE (Harman/NIST)– http://potomac.ncsl.nist.gov/prise

• SMART (Buckley/Cornell)

• CHESHIRE (Larson/Berkeley)– http://cheshire.lib.berkeley.edu

Page 59: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 59IS 202 – FALL 2002

Structure of an IR System

SearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

Page 60: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 60IS 202 – FALL 2002

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

Page 61: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 61IS 202 – FALL 2002

Structure of an IR System

SearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

Page 62: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 62IS 202 – FALL 2002

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

Page 63: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 63IS 202 – FALL 2002

Relevance (Introduction)

• In what ways can a document be relevant to a query?– Answer precise question precisely

• Who is buried in grant’s tomb? Grant.

– Partially answer question• Where is Danville? Near Walnut Creek.

– Suggest a source for more information.• What is lymphodema? Look in this Medical Dictionary.

– Give background information– Remind the user of other knowledge– Others...

Page 64: 2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.22 - SLIDE 64IS 202 – FALL 2002

Next Time

• Boolean Search Logic

• Preparing information for search– Lexical analysis

• Using Lexis-Nexis (Assignment 8)