1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

1

CS 430: Information Discovery

Lecture 26

Automated Information Retrieval

2

Course Administration

3

Information Discovery

People have many reasons to look for information:

• Known item Where will I find the wording of the US Copyright Act?

• Facts What is the capital of Barbados?

• Introduction or overview How do diesel engines work?

• Related information Is there a review of this article?

• Comprehensive search What is known of the effects of global warming on

hurricanes?

4

Types of Information Discovery

media type

text image, video, audio, etc.

searching browsing

linking

statistical user-in-loopcatalogs, indexes (metadata)

CS 502

natural language

processing

CS 474No human effort

By user

5

Automated information discovery

Creating catalog records manually is labor intensive and hence expensive.

The aim of automatic indexing is to build indexes and retrieve information without human intervention.

The aim of automated information discovery is for users to discover information without using skilled human effort to build indexes.

6

Resources for automated information discovery

Computer power

brute force computing

ranking methods

automatic generation of metadata

The intelligence of the user

browsing

relevance feedback

information visualization

7

Brute force computing

Few people really understand Moore's Law

-- Computing power doubles every 18 months

-- Increases 100 times in 10 years

-- Increases 10,000 times in 20 years

Simple algorithms + immense computing powermay outperform human intelligence

8

Problems with (old-fashioned) Boolean searching

With Boolean retrieval, a document either matches a query exactly or not at all

• Encourages short queries• Requires precise choice of index terms (professional indexing)• Requires precise formulation of queries (professional training)

9

Relevance and Ranking

Classical methods assume that a document is either relevant to a query or not relevant.

Often a user will consider a document to be partially relevant.

Ranking methods: measure the degree of similarity between a query and a document.

Requests DocumentsSimilar

Similar: How similar is document to a request?

10

Contrast with (old-fashioned) Boolean searching

With Boolean retrieval, a document either matches a query exactly or not at all

• Encourages short queries• Requires precise choice of index terms• Requires precise formulation of queries (professional training)

With retrieval using similarity measures, similarities range from 0 to 1 for all documents

• Encourages long queries (to have as many dimensions as possible)• Benefits from large numbers of index terms• Permits queries with many terms, not all of which need match the document

11

SMART System

An experimental system for automatic information retrieval

• automatic indexing to assign terms to documents and queries

• identify documents to be retrieved by calculating similarities between documents and queries

• collect related documents into common subject classes

• procedures for producing an improved search query based on information obtained from earlier searches

Gerald Salton and colleagues Harvard 1964-1968 Cornell 1968-1988

12

t1

t2

t3

d1 d2

The space has as many dimensions as there are terms in the word list.

The index term vector space

13

Vector similarity computation

Documents in a collection are assigned terms from a set of n terms

The term assignment array T is defined as

if term j does not occur in document i, tij = 0

if term j occurs in document i, tij is greater than zero (the value of tij is called the weight of term j in document i)

Similarity between di and dj is defined as

tiktjk

|di| |dj|

k=1

n

cos(di, dj) =

14

Term weighting

Zipf's Law: If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation:

r(w) * f(w) = c

This suggests that some terms are more effective than others in retrieval.

In particular relative frequency is a useful measure that identifies terms that occur with substantial frequency in some documents, but with relatively low overall collection frequency.

Term weights are functions that are used to quantify these concepts.

15

Term Frequency

Concept

A term that appears many times within a document is likely to be more important than a term that appears only once.

16

Inverse Document Frequency

Concept

A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

17

Ranking -- Practical Experience

1. Basic method is inner (dot) product with no weighting

2. Cosine (dividing by product of lengths) normalizes for vectors of different lengths

3. Term weighting using frequency of terms in document usually improves ranking

4. Term weighting using an inverse function of terms in the entire collection improves ranking (e.g., IDF)

5. Weightings for document structure improve ranking

6. Relevance weightings after initial retrieval improve ranking

Effectiveness of methods depends on characteristics of the collection. In general, there are few improvements beyond simple weighting schemes.

18

Page Rank Algorithm (Google)

Concept:

The rank of a web page is higher if many pages link to it.

Links from highly ranked pages are given greater weight than links from less highly ranked pages.

19

Google PageRank Model

A user:

1. Starts at a random page on the web

2a. With probability p, selects any random page and jumps to it

2b. With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page

3. Repeats Step 2a and 2b a very large number of times

Pages are ranked according to the relative frequency with which they are visited.

20

Compare TF.IDF to PageRank

With TF.IDF document are ranked depending on how well they match a specific query.

With PageRank, the pages are ranked in order of importance, with no reference to a specific query.

21

Latent Semantic Indexing

Objective

Replace indexes that use sets of index terms by indexes that use concepts.

Approach

Map the index term vector space into a lower dimensional space, using singular value decomposition.

22

Use of Concept Space: Term Suggestion

23

Non-Textual Materials

Content Attribute

maps lat. and long., content

photograph subject, date and place

bird songs and images field mark, bird song

software task, algorithm

data set survey characteristics

video subject, date, etc.

24

Direct Searching of Content

Sometimes it is possible to match a query against the content of a digital object. The effectiveness varies from field to field.

Examples

• Images -- crude characteristics of color, texture, shape, etc.

• Music -- optical recognition of score

• Bird song -- spectral analysis of sounds

• Fingerprints

25

Image Retrieval: Blobworld

26

Automated generation of metadata

• Vector methods are for textual material only.

• Metadata is needed for non-textual materials. (Vector methods can be applied to textual metadata.)

• Automated extraction of metadata is still weak because of the semantic knowledge needed.

27

Surrogates for non-textual materials

Textual catalog record about a non-textual item (photograph)

Surrogate

Text based methods of information retrieval can search a surrogate for a photograph

28

Library of Congress catalog record

CREATED/PUBLISHED: [between 1925 and 1930?]

SUMMARY: U. S. President Calvin Coolidge sits at a desk and signs a photograph, probably in Denver, Colorado. A group of unidentified men look on.

NOTES: Title supplied by cataloger. Source: Morey Engle.

SUBJECTS: Coolidge, Calvin,--1872-1933. Presidents--United States--1920-1930. Autographing--Colorado--Denver--1920-1930. Denver (Colo.)--1920-1930. Photographic prints.

MEDIUM: 1 photoprint ; 21 x 26 cm. (8 x 10 in.)

29

Photographs: Cataloguing Difficulties

Automatic

• Image recognition methods are very primitive

Manual

• Photographic collections can be very large

• Many photographs may show the same subject

• Photographs have little or no internal metadata (no title page)

• The subject of a photograph may not be known (Who are the people in a picture? Where is the location?)

30

31

DC-dot applied to http://www.georgewbush.com/

<link rel="schema.DC" href="http://purl.org/dc">

<meta name="DC.Subject" content="George W. Bush; Bush; George Bush; President; republican; 2000 election; election; presidential election; George; B2K; Bush for President; Junior; Texas; Governor; taxes; technology; education; agriculture; health care; environment; society; social security; medicare; income tax; foreign policy; defense; government">

<meta name="DC.Description" content="George W. Bush is running for President of the United States to keep the country prosperous.">

continued on next slide

Automatic record for George W. Bush home page

32

DC-dot applied to http://www.georgewbush.com/

<meta name="DC.Publisher" content="Concentric Network Corporation">

<meta name="DC.Date" scheme="W3CDTF" content="2001-01-12">

<meta name="DC.Type" scheme="DCMIType" content="Text">

<meta name="DC.Format" content="text/html">

<meta name="DC.Format" content="12223 bytes">

<meta name="DC.Identifier" content="http://www.georgewbush.com/">

Automatic record for George W. Bush home page (continued)

33

Informedia: the need for metadata

A video sequence is awkward for information discovery:

• Textual methods of information retrieval cannot be applied

• Browsing requires the user to view the sequence. Fast skimming is difficult.

• Computing requirements are demanding (MPEG-1 requires 1.2 Mbits/sec).

Surrogates are required

34

Multi-Modal Information Discovery

The multi-modal approach to information retrieval

Computer programs to analyze video materials for clues e.g., changes of scene.

• methods from artificial intelligence, e.g., speech recognition, natural language processing, image recognition.

• analysis of video track, sound track, closed captioning if present, any other information.

Each mode gives imperfect information. Therefore use many approaches and combine the evidence.

35

Informedia Library Creation

Video Audio Text

Speech recognition

Image extraction

Natural language interpretation

SegmentationSegments

with derived metadata

36

Harnessing the intelligence of the user

• Relevance feedback

• Support for browsing

• Information visualization

37

The Human in the Loop

Search index

Return hits

Browse repository

Return objects

38

Informedia: Information Discovery

User

Segments with derived

metadata

Browsing via multimedia surrogates

Querying via natural

languageRequested segments

and metadata

39

MIRA

Evaluation Frameworks for Interactive Multimedia Information Retrieval Applications

• Information Retrieval techniques are beginning to be used in complex goal and task oriented systems whose main objectives are not just the retrieval of information.

• New original research in IR is being blocked or hampered by the lack of a broader framework for evaluation.

European study, 1996-99

40

MIRA Aims

• Bring the user back into the evaluation process.

• Understand the changing nature of IR tasks and their evaluation.

• 'Evaluate' traditional evaluation methodologies.

• Consider how evaluation can be prescriptive of IR design

• Move towards balanced approach (system versus user)

• Understand how interaction affects evaluation.

• Support the move from static to dynamic evaluation.

• Understand how new media affects evaluation.

• Make evaluation methods more practical for smaller groups.

• Spawn new projects to develop new evaluation frameworks

41

Feedback in the Vector Space Model

Document vectors as points on a surface

• Normalize all document vectors to be of length 1

• Then the ends of the vectors all lie on a surface with unit radius

• For similar documents, we can represent parts of this surface as a flat region

• Similar document are represented as points that are close together on this surface

42

Relevance feedback (concept)

x x

xx

oo

o

hits from original search

x documents identified as non-relevanto documents identified as relevant original query reformulated query

43

Document clustering (concept)

x

x

xx

xx x

xx

x

x

xx

xx

x x

x

x

Document clusters are a form of automatic classification.

A document may be in several clusters.

44

Browsing in Information Space

x

x

xxxx

x

x

x

xx

x

x x

Starting point

Effectiveness depends on

(a) Starting point

(b) Effective feedback

(c) Convenience

45

User Interface Concepts

Users need a variety of ways to search and browse, depending on the task being carried out and preferred style of working

• Visual icons

one-line headlinesfilm strip viewsvideo skimstranscript following of audio track

• Collages

• Semantic zooming

• Results set

• Named faces

• Skimming

46

47

48

49

Alexandria User Interface

50

51

Information Visualization: Tilebars

The figure represents a set of hits from a text search.

Each large rectangle represents a document or section of text.

Each row represents a search term or subquery.

The density of each small square indicates the frequency with which a term appears in a section of a document.

Hearst 1995

52

Information Visualization: Dendrogram

alpha delta golf bravo echo charlie foxtrot

1

23

6

4

5

53

Self Organizing Maps (SOM)

Information Visualization:

54

55

Google has proved ...

For a very wide range of users entirely automated:

selectionindexingranking

combined with

searching by untrained usersand online browsing

is a very effective form of information discovery.

56

Searching

Changing users, changing user interfaces

From To

Trained user or librarian Untrained user

Controlled vocabulary Natural language

Fielded searching Unfielded text

Manually created records Full text

Boolean algorithms Ranking methods

Stateful protocols Stateless protocols

57

Information Discovery:1991 and 2001

1991 2001

Content print online

Computing expensive inexpensive

Choice of content selective comprehensive

Index creation human automatic

Frequency one time monthly

Vocabulary controlled not controlled

Query Boolean ranked retrieval

Users trained untrained