2002.10.22 - SLIDE 1 IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002 http://www.sims.berkeley.edu/academics/courses/ is202/f02/ SIMS 202: Information Organization and Retrieval Lecture 15: Intro to Information Retrieval
64
Embed
2002.10.22 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
2002.10.22 - SLIDE 1IS 202 – FALL 2002
Prof. Ray Larson & Prof. Marc Davis
UC Berkeley SIMS
Tuesday and Thursday 10:30 am - 12:00 pm
Fall 2002http://www.sims.berkeley.edu/academics/courses/is202/f02/
SIMS 202:
Information Organization
and Retrieval
Lecture 15: Intro to Information Retrieval
2002.10.22 - SLIDE 2IS 202 – FALL 2002
Lecture Overview
• Review
– Database Design
– Normalization
– Web-enabled Databases
• Introduction to Information Retrieval
• The Information Seeking Process
• Information Retrieval History and Developments
Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey
2002.10.22 - SLIDE 3IS 202 – FALL 2002
Models (1)
ConceptualModel
LogicalModel
External Model
Conceptual requirements
Conceptual requirements
Conceptual requirements
Conceptual requirements
Application 1
Application 1
Application 2 Application 3 Application 4
Application 2
Application 3
Application 4
External Model
External Model
External Model
Internal Model
2002.10.22 - SLIDE 4IS 202 – FALL 2002
Database System Life Cycle
Growth,Change, &
Maintenance6
Operations5
Integration4
Design1
Conversion3
PhysicalCreation
2
2002.10.22 - SLIDE 5IS 202 – FALL 2002
Normal Forms
• First Normal Form (1NF)
• Second Normal Form (2NF)
• Third Normal Form (3NF)
• Boyce-Codd Normal Form (BCNF)
• Fourth Normal Form (4NF)
• Fifth Normal Form (5NF)
2002.10.22 - SLIDE 6IS 202 – FALL 2002
Normalization
Boyce-Codd and
Higher
Functional dependencyof nonkey attributes on the primary key - Atomic values only
Full Functional dependencyof nonkey attributes on the primary key
No transitive dependency between nonkey attributes
All determinants are candidate keys - Single multivalued dependency
2002.10.22 - SLIDE 7IS 202 – FALL 2002
Dynamic Web Applications 2
Server
database
CGI
DBMS
Web Server
Internet
Files
Clients
database
database
2002.10.22 - SLIDE 8IS 202 – FALL 2002
Server Interfaces
Adapted from John P. Ashenfelter, Choosing a Database for Your Web Site
DatabaseWeb Server
Web ApplicationServer
Web DBApp
HTML
JavaScript
DHTML
CGI
Web Server API’s
ColdFusion PhP Perl
Java ASP
SQL
ODBCNative DBinterfaces JDBC
Native DB
Interfaces
2002.10.22 - SLIDE 9IS 202 – FALL 2002
Photo Browser
• The current photo browser uses a combination of – Javascript for expandable hierarchies– Database in MS Access– ColdFusion to search the database when one
of the facets is selected
• The database design for the photo database currently looks like…
2002.10.22 - SLIDE 10IS 202 – FALL 2002
Photo Browser ER
2002.10.22 - SLIDE 11IS 202 – FALL 2002
Photo Database
• Lets look at the photo database in the Access interface– Multi-Facet queries– Queries for multiple descriptors in the same
facet (harder)
2002.10.22 - SLIDE 12IS 202 – FALL 2002
Lecture Overview
• Review
– Database Design
– Normalization
– Web-enabled Databases
• Introduction to Information Retrieval
• The Information Seeking Process
• Information Retrieval History and Developments
Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey
2002.10.22 - SLIDE 13IS 202 – FALL 2002
Review: Information Overload
• “The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.” (Varian & Lyman)
• “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)
2002.10.22 - SLIDE 14IS 202 – FALL 2002
Course Outline
• Organization– Overview– Categorization– Metadata and markup– Metadata for multimedia
• How to describe information resources or information-bearing objects in ways so that they may be effectively used by those who need to use them– Organizing
• How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs– Retrieving
2002.10.22 - SLIDE 16IS 202 – FALL 2002
Key Issues
Creation
Utilization Searching
Active
Inactive
Semi-Active
Retention/Mining
Disposition
Discard
Using Creating
AuthoringModifying
OrganizingIndexing
StoringRetrieval
DistributionNetworking
AccessingFiltering
2002.10.22 - SLIDE 17IS 202 – FALL 2002
IR Textbook Topics
2002.10.22 - SLIDE 18IS 202 – FALL 2002
More
Deta
iled
Vie
w
2002.10.22 - SLIDE 19IS 202 – FALL 2002
What
We’ll C
over
A Lot
A Little
2002.10.22 - SLIDE 20IS 202 – FALL 2002
IR Topics for 202
• The Search Process• Information Retrieval Models• Content Analysis/Zipf Distributions• Evaluation of IR Systems
– Precision/Recall– Relevance– User Studies
• System and Implementation Issues• Web-Specific Issues• User Interface Issues• Special Kinds of Search
2002.10.22 - SLIDE 21IS 202 – FALL 2002
Lecture Overview
• Review
– Database Design
– Normalization
– Web-enabled Databases
• Introduction to Information Retrieval
• The Information Seeking Process
• Information Retrieval History and Developments
Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey
2002.10.22 - SLIDE 22IS 202 – FALL 2002
The Standard Retrieval Interaction Model
2002.10.22 - SLIDE 23IS 202 – FALL 2002
Standard Model of IR
• Assumptions:– Maximizing precision and recall
simultaneously– The information need remains static– The value is in the resulting document set
2002.10.22 - SLIDE 24IS 202 – FALL 2002
Problems with Standard Model
• Users learn during the search process:– Scanning titles of retrieved documents– Reading retrieved documents– Viewing lists of related topics/thesaurus terms– Navigating hyperlinks
• Some users don’t like long disorganized lists of documents
2002.10.22 - SLIDE 25IS 202 – FALL 2002
IR is an Iterative Process
Repositories
Workspace
Goals
2002.10.22 - SLIDE 26IS 202 – FALL 2002
IR is a Dialog
• The exchange doesn’t end with first answer• User can recognize elements of a useful answer• Questions and understanding changes as the
process continues
2002.10.22 - SLIDE 27IS 202 – FALL 2002
Bates’ “Berry-Picking” Model
• Standard IR model– Assumes the information need remains the
same throughout the search process
• Berry-picking model– Interesting information is scattered like berries
among bushes– The query is continually shifting
2002.10.22 - SLIDE 28IS 202 – FALL 2002
Berry-Picking Model
Q0
Q1
Q2
Q3
Q4
Q5
A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89)
2002.10.22 - SLIDE 29IS 202 – FALL 2002
Berry-Picking Model (cont.)
• The query is continually shifting
• New information may yield new ideas and new directions
• The information need– Is not satisfied by a single, final retrieved set– Is satisfied by a series of selections and bits
of information found along the way
2002.10.22 - SLIDE 30IS 202 – FALL 2002
Information Seeking Behavior
• Two parts of a process:– Search and retrieval – Analysis and synthesis of search results
• This is a fuzzy area– We will look at several different working
theories
2002.10.22 - SLIDE 31IS 202 – FALL 2002
Search Tactics and Strategies
• Search Tactics– Bates 1979
• Search Strategies– Bates 1989– O’Day and Jeffries 1993
2002.10.22 - SLIDE 32IS 202 – FALL 2002
Tactics vs. Strategies
• Tactic: short term goals and maneuvers– Operators, actions
• Strategy: overall planning– Link a sequence of operators together to
achieve some end
2002.10.22 - SLIDE 33IS 202 – FALL 2002
Information Search Tactics
• Monitoring tactics– Keep search on track
• Source-level tactics– Navigate to and within sources
• Term and Search Formulation tactics– Designing search formulation– Selection and revision of specific terms within
search formulation
2002.10.22 - SLIDE 34IS 202 – FALL 2002
Monitoring Tactics (Strategy-Level)
• Check– Compare original goal with current state
• Weigh– Make a cost/benefit analysis of current or
anticipated actions
• Pattern– Recognize common strategies
• Correct Errors• Record
– Keep track of (incomplete) paths
2002.10.22 - SLIDE 35IS 202 – FALL 2002
Source-Level Tactics
• “Bibble”:– Look for a pre-defined result set
• E.g., a good link page on web
• Survey:– Look ahead, review available options
• E.g., don’t simply use the first term or first source that comes to mind
• Cut:– Eliminate large proportion of search domain
• E.g., search on rarest term first
2002.10.22 - SLIDE 36IS 202 – FALL 2002
Source-Level Tactics (cont.)
• Stretch– Use source in unintended way
• E.g., use patents to find addresses
• Scaffold– Take an indirect route to goal
• E.g., when looking for references to obscure poet, look up contemporaries
• Cleave– Binary search in an ordered file
2002.10.22 - SLIDE 37IS 202 – FALL 2002
Search Formulation Tactics
• Specify– Use as specific terms as possible
• Exhaust– Use all possible elements in a query
• Reduce– Subtract elements from a query
• Parallel– Use synonyms and parallel terms
• Pinpoint– Reducing parallel terms and refocusing query
• Block– To reject or block some terms, even at the cost of
losing some relevant documents
2002.10.22 - SLIDE 38IS 202 – FALL 2002
Term Tactics
• Move around the thesaurus– Superordinate, subordinate, coordinate – Neighbor (semantic or alphabetic)– Trace – pull out terms from information
already seen as part of search (titles, etc.)– Morphological and other spelling variants– Antonyms (contrary)
2002.10.22 - SLIDE 39IS 202 – FALL 2002
Additional Considerations (Bates 79)
• Add a Sort tactic!• More detail is needed about short-term
cost/benefit decision rule strategies• When to stop?
– How to judge when enough information has been gathered?
– How to decide when to give up an unsuccessful search?
– When to stop searching in one source and move to another?
2002.10.22 - SLIDE 40IS 202 – FALL 2002
Implications
• Interfaces should make it easy to store intermediate results
• Interfaces should make it easy to follow trails with unanticipated results
• Makes evaluation more difficult
2002.10.22 - SLIDE 41IS 202 – FALL 2002
• Later in the course:– More on Search Process and Strategies– User interfaces to improve IR process– Incorporation of Content Analysis into better
systems
More Later…
2002.10.22 - SLIDE 42IS 202 – FALL 2002
Restricted Form of the IR Problem
• The system has available only pre-existing, “canned” text passages
• Its response is limited to selecting from these passages and presenting them to the user
• It must select, say, 10 or 20 passages out of millions or billions!
2002.10.22 - SLIDE 43IS 202 – FALL 2002
Information Retrieval
• Revised Task Statement:
Build a system that retrieves documents that users are likely to find relevant to their queries
• This set of assumptions underlies the field of Information Retrieval
2002.10.22 - SLIDE 44IS 202 – FALL 2002
Lecture Overview
• Review
– Database Design
– Normalization
– Web-enabled Databases
• Introduction to Information Retrieval
• The Information Seeking Process
• Information Retrieval History and Developments
Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey
2002.10.22 - SLIDE 45IS 202 – FALL 2002
Visions of IR Systems
• Paul Otlet, 1930’s
• Emanuel Goldberg, 1920’s - 1940’s
• H.G. Wells, “World Brain: The idea of a permanent World Encyclopedia.” (Introduction to the Encyclopedie Francaise), 1937.
• Vannevar Bush, “As we may think.” Atlantic Monthly, 1945.
2002.10.22 - SLIDE 46IS 202 – FALL 2002
Card-Based IR Systems
• Uniterm (Casey, Perry, Berry, Kent: 1958)– Developed and used from mid 1940’s)
• Bagley’s 1951 MS thesis from MIT suggested that searching 50 million item records, each containing 30 index terms would take approximately 41,700 hours – Due to the need to move and shift the text in
core memory while carrying out the comparisons
• 1957 – Desk Set with Katharine Hepburn and Spencer Tracy – EMERAC
2002.10.22 - SLIDE 50IS 202 – FALL 2002
Historical Milestones in IR Research
• 1958 Statistic Language Properties (Luhn)• 1960 Probabilistic Indexing (Maron & Kuhns)• 1961 Term association and clustering (Doyle)• 1965 Vector Space Model (Salton)• 1968 Query expansion (Roccio, Salton)• 1972 Statistical Weighting (Sparck-Jones)• 1975 2-Poisson Model (Harter, Bookstein,