Top Banner
1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval
57

1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

Jan 13, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

1

CS 430: Information Discovery

Lecture 26

Automated Information Retrieval

Page 2: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

2

Course Administration

Page 3: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

3

Information Discovery

People have many reasons to look for information:

• Known item Where will I find the wording of the US Copyright Act?

• Facts What is the capital of Barbados?

• Introduction or overview How do diesel engines work?

• Related information Is there a review of this article?

• Comprehensive search What is known of the effects of global warming on

hurricanes?

Page 4: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

4

Types of Information Discovery

media type

text image, video, audio, etc.

searching browsing

linking

statistical user-in-loopcatalogs, indexes (metadata)

CS 502

natural language

processing

CS 474No human effort

By user

Page 5: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

5

Automated information discovery

Creating catalog records manually is labor intensive and hence expensive.

The aim of automatic indexing is to build indexes and retrieve information without human intervention.

The aim of automated information discovery is for users to discover information without using skilled human effort to build indexes.

Page 6: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

6

Resources for automated information discovery

Computer power

brute force computing

ranking methods

automatic generation of metadata

The intelligence of the user

browsing

relevance feedback

information visualization

Page 7: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

7

Brute force computing

Few people really understand Moore's Law

-- Computing power doubles every 18 months

-- Increases 100 times in 10 years

-- Increases 10,000 times in 20 years

Simple algorithms + immense computing powermay outperform human intelligence

Page 8: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

8

Problems with (old-fashioned) Boolean searching

With Boolean retrieval, a document either matches a query exactly or not at all

• Encourages short queries• Requires precise choice of index terms (professional indexing)• Requires precise formulation of queries (professional training)

Page 9: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

9

Relevance and Ranking

Classical methods assume that a document is either relevant to a query or not relevant.

Often a user will consider a document to be partially relevant.

Ranking methods: measure the degree of similarity between a query and a document.

Requests DocumentsSimilar

Similar: How similar is document to a request?

Page 10: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

10

Contrast with (old-fashioned) Boolean searching

With Boolean retrieval, a document either matches a query exactly or not at all

• Encourages short queries• Requires precise choice of index terms• Requires precise formulation of queries (professional training)

With retrieval using similarity measures, similarities range from 0 to 1 for all documents

• Encourages long queries (to have as many dimensions as possible)• Benefits from large numbers of index terms• Permits queries with many terms, not all of which need match the document

Page 11: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

11

SMART System

An experimental system for automatic information retrieval

• automatic indexing to assign terms to documents and queries

• identify documents to be retrieved by calculating similarities between documents and queries

• collect related documents into common subject classes

• procedures for producing an improved search query based on information obtained from earlier searches

Gerald Salton and colleagues Harvard 1964-1968 Cornell 1968-1988

Page 12: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

12

t1

t2

t3

d1 d2

The space has as many dimensions as there are terms in the word list.

The index term vector space

Page 13: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

13

Vector similarity computation

Documents in a collection are assigned terms from a set of n terms

The term assignment array T is defined as

if term j does not occur in document i, tij = 0

if term j occurs in document i, tij is greater than zero (the value of tij is called the weight of term j in document i)

Similarity between di and dj is defined as

tiktjk

|di| |dj|

k=1

n

cos(di, dj) =

Page 14: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

14

Term weighting

Zipf's Law: If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation:

r(w) * f(w) = c

This suggests that some terms are more effective than others in retrieval.

In particular relative frequency is a useful measure that identifies terms that occur with substantial frequency in some documents, but with relatively low overall collection frequency.

Term weights are functions that are used to quantify these concepts.

Page 15: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

15

Term Frequency

Concept

A term that appears many times within a document is likely to be more important than a term that appears only once.

Page 16: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

16

Inverse Document Frequency

Concept

A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

Page 17: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

17

Ranking -- Practical Experience

1. Basic method is inner (dot) product with no weighting

2. Cosine (dividing by product of lengths) normalizes for vectors of different lengths

3. Term weighting using frequency of terms in document usually improves ranking

4. Term weighting using an inverse function of terms in the entire collection improves ranking (e.g., IDF)

5. Weightings for document structure improve ranking

6. Relevance weightings after initial retrieval improve ranking

Effectiveness of methods depends on characteristics of the collection. In general, there are few improvements beyond simple weighting schemes.

Page 18: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

18

Page Rank Algorithm (Google)

Concept:

The rank of a web page is higher if many pages link to it.

Links from highly ranked pages are given greater weight than links from less highly ranked pages.

Page 19: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

19

Google PageRank Model

A user:

1. Starts at a random page on the web

2a. With probability p, selects any random page and jumps to it

2b. With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page

3. Repeats Step 2a and 2b a very large number of times

Pages are ranked according to the relative frequency with which they are visited.

Page 20: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

20

Compare TF.IDF to PageRank

With TF.IDF document are ranked depending on how well they match a specific query.

With PageRank, the pages are ranked in order of importance, with no reference to a specific query.

Page 21: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

21

Latent Semantic Indexing

Objective

Replace indexes that use sets of index terms by indexes that use concepts.

Approach

Map the index term vector space into a lower dimensional space, using singular value decomposition.

Page 22: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

22

Use of Concept Space: Term Suggestion

Page 23: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

23

Non-Textual Materials

Content Attribute

maps lat. and long., content

photograph subject, date and place

bird songs and images field mark, bird song

software task, algorithm

data set survey characteristics

video subject, date, etc.

Page 24: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

24

Direct Searching of Content

Sometimes it is possible to match a query against the content of a digital object. The effectiveness varies from field to field.

Examples

• Images -- crude characteristics of color, texture, shape, etc.

• Music -- optical recognition of score

• Bird song -- spectral analysis of sounds

• Fingerprints

Page 25: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

25

Image Retrieval: Blobworld

Page 26: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

26

Automated generation of metadata

• Vector methods are for textual material only.

• Metadata is needed for non-textual materials. (Vector methods can be applied to textual metadata.)

• Automated extraction of metadata is still weak because of the semantic knowledge needed.

Page 27: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

27

Surrogates for non-textual materials

Textual catalog record about a non-textual item (photograph)

Surrogate

Text based methods of information retrieval can search a surrogate for a photograph

Page 28: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

28

Library of Congress catalog record

CREATED/PUBLISHED: [between 1925 and 1930?]

SUMMARY: U. S. President Calvin Coolidge sits at a desk and signs a photograph, probably in Denver, Colorado. A group of unidentified men look on.

NOTES: Title supplied by cataloger. Source: Morey Engle.

SUBJECTS: Coolidge, Calvin,--1872-1933. Presidents--United States--1920-1930. Autographing--Colorado--Denver--1920-1930. Denver (Colo.)--1920-1930. Photographic prints.

MEDIUM: 1 photoprint ; 21 x 26 cm. (8 x 10 in.)

Page 29: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

29

Photographs: Cataloguing Difficulties

Automatic

• Image recognition methods are very primitive

Manual

• Photographic collections can be very large

• Many photographs may show the same subject

• Photographs have little or no internal metadata (no title page)

• The subject of a photograph may not be known (Who are the people in a picture? Where is the location?)

Page 30: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

30

Page 31: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

31

DC-dot applied to http://www.georgewbush.com/

<link rel="schema.DC" href="http://purl.org/dc">

<meta name="DC.Subject" content="George W. Bush; Bush; George Bush; President; republican; 2000 election; election; presidential election; George; B2K; Bush for President; Junior; Texas; Governor; taxes; technology; education; agriculture; health care; environment; society; social security; medicare; income tax; foreign policy; defense; government">

<meta name="DC.Description" content="George W. Bush is running for President of the United States to keep the country prosperous.">

continued on next slide

Automatic record for George W. Bush home page

Page 32: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

32

DC-dot applied to http://www.georgewbush.com/

<meta name="DC.Publisher" content="Concentric Network Corporation">

<meta name="DC.Date" scheme="W3CDTF" content="2001-01-12">

<meta name="DC.Type" scheme="DCMIType" content="Text">

<meta name="DC.Format" content="text/html">

<meta name="DC.Format" content="12223 bytes">

<meta name="DC.Identifier" content="http://www.georgewbush.com/">

Automatic record for George W. Bush home page (continued)

Page 33: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

33

Informedia: the need for metadata

A video sequence is awkward for information discovery:

• Textual methods of information retrieval cannot be applied

• Browsing requires the user to view the sequence. Fast skimming is difficult.

• Computing requirements are demanding (MPEG-1 requires 1.2 Mbits/sec).

Surrogates are required

Page 34: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

34

Multi-Modal Information Discovery

The multi-modal approach to information retrieval

Computer programs to analyze video materials for clues e.g., changes of scene.

• methods from artificial intelligence, e.g., speech recognition, natural language processing, image recognition.

• analysis of video track, sound track, closed captioning if present, any other information.

Each mode gives imperfect information. Therefore use many approaches and combine the evidence.

Page 35: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

35

Informedia Library Creation

Video Audio Text

Speech recognition

Image extraction

Natural language interpretation

SegmentationSegments

with derived metadata

Page 36: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

36

Harnessing the intelligence of the user

• Relevance feedback

• Support for browsing

• Information visualization

Page 37: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

37

The Human in the Loop

Search index

Return hits

Browse repository

Return objects

Page 38: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

38

Informedia: Information Discovery

User

Segments with derived

metadata

Browsing via multimedia surrogates

Querying via natural

languageRequested segments

and metadata

Page 39: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

39

MIRA

Evaluation Frameworks for Interactive Multimedia Information Retrieval Applications

• Information Retrieval techniques are beginning to be used in complex goal and task oriented systems whose main objectives are not just the retrieval of information.

• New original research in IR is being blocked or hampered by the lack of a broader framework for evaluation.

European study, 1996-99

Page 40: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

40

MIRA Aims

• Bring the user back into the evaluation process.

• Understand the changing nature of IR tasks and their evaluation.

• 'Evaluate' traditional evaluation methodologies.

• Consider how evaluation can be prescriptive of IR design

• Move towards balanced approach (system versus user)

• Understand how interaction affects evaluation.

• Support the move from static to dynamic evaluation.

• Understand how new media affects evaluation.

• Make evaluation methods more practical for smaller groups.

• Spawn new projects to develop new evaluation frameworks

Page 41: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

41

Feedback in the Vector Space Model

Document vectors as points on a surface

• Normalize all document vectors to be of length 1

• Then the ends of the vectors all lie on a surface with unit radius

• For similar documents, we can represent parts of this surface as a flat region

• Similar document are represented as points that are close together on this surface

Page 42: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

42

Relevance feedback (concept)

x x

xx

oo

o

hits from original search

x documents identified as non-relevanto documents identified as relevant original query reformulated query

Page 43: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

43

Document clustering (concept)

x

x

xx

xx x

xx

x

x

xx

xx

x x

x

x

Document clusters are a form of automatic classification.

A document may be in several clusters.

Page 44: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

44

Browsing in Information Space

x

x

xxxx

x

x

x

xx

x

x x

Starting point

Effectiveness depends on

(a) Starting point

(b) Effective feedback

(c) Convenience

Page 45: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

45

User Interface Concepts

Users need a variety of ways to search and browse, depending on the task being carried out and preferred style of working

• Visual icons

one-line headlinesfilm strip viewsvideo skimstranscript following of audio track

• Collages

• Semantic zooming

• Results set

• Named faces

• Skimming

Page 46: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

46

Page 47: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

47

Page 48: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

48

Page 49: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

49

Alexandria User Interface

Page 50: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

50

Page 51: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

51

Information Visualization: Tilebars

The figure represents a set of hits from a text search.

Each large rectangle represents a document or section of text.

Each row represents a search term or subquery.

The density of each small square indicates the frequency with which a term appears in a section of a document.

Hearst 1995

Page 52: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

52

Information Visualization: Dendrogram

alpha delta golf bravo echo charlie foxtrot

1

23

6

4

5

Page 53: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

53

Self Organizing Maps (SOM)

Information Visualization:

Page 54: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

54

Page 55: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

55

Google has proved ...

For a very wide range of users entirely automated:

selectionindexingranking

combined with

searching by untrained usersand online browsing

is a very effective form of information discovery.

Page 56: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

56

Searching

Changing users, changing user interfaces

From To

Trained user or librarian Untrained user

Controlled vocabulary Natural language

Fielded searching Unfielded text

Manually created records Full text

Boolean algorithms Ranking methods

Stateful protocols Stateless protocols

Page 57: 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

57

Information Discovery:1991 and 2001

1991 2001

Content print online

Computing expensive inexpensive

Choice of content selective comprehensive

Index creation human automatic

Frequency one time monthly

Vocabulary controlled not controlled

Query Boolean ranked retrieval

Users trained untrained