Top Banner
E.G.M Petrakis Multimedia Information Retrieval 1 Information Retrieval (IR) Deals with the representation, storage and retrieval of unstructured data Topics of interest: systems, languages, retrieval, user interfaces, data visualization, distributed data sets Classical IR deals mainly with text The evolution of multimedia databases and of the web have given new interest to IR
50

E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR) Deals with the representation, storage and retrieval of unstructured data.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

1

Information Retrieval (IR)

Deals with the representation, storage and retrieval of unstructured data

Topics of interest: systems, languages, retrieval, user interfaces, data visualization, distributed data sets

Classical IR deals mainly with textThe evolution of multimedia databases

and of the web have given new interest to IR

Page 2: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

2

Multimedia IR

A multimedia information system that can store and retrieve attributes text2D grey-scale and color images 1D time series digitized voice or music video

Page 3: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

3

Applications

Financial, marketing: stock prices, sales etc. find companies whose stock prices move similarly

Scientific databases: sensor data whether, geological, environmental data

Office automation, electronic encyclopedias, electronic books

Medical databases X-rays, CT, MRI scans Criminal investigation suspects, fingerprints,Personal archives, text and color images

Page 4: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

4

Queries

Formulation of user’s information need Free-text queryBy example document (e.g., text, image) e.g., in a collection of color photos find

those showing a tree close to a house Retrieval is based on the

understanding of the content of documents and of their components

Page 5: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

5

Goals of Retrieval

Accuracy: retrieve documents that the user expects in the answer With as few incorrect answers as

possibleAll relevant answers are retrieved

Speed: retrievals has to be fast The system responds in real time

Page 6: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

Accuracy of Retrieval

Depends on query criteria, query complexity and specificityAttributes / Content criteria?

Depends on what is matched with the query How documents content is representedTypically feature vectors

Depends also on matching function Euclidean distance

E.G.M Petrakis Multimedia Information Retrieval

6

Page 7: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

7

Speed of Retrieval

The query is compared (matched) with all stored documents

The definition of similarity criteria (similarity or distance function between documents) is an important issue: Similarity or distance function between

documentsMatching has to fast: document matching has to

be computationally efficient and sequential searching must be avoided

Indexing: search documents that are likely to match the query

Page 8: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

Doc1-FeactureVector

Doc2-FeactureVector

DocN-FeatureVector

Database

E.G.M Petrakis Multimedia Information Retrieval

8

databasedescriptions documents

Doc1

Doc2

DocN

index

query

Page 9: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

9

Main Idea

Descriptions are extracted and storedSame or separate storage with documentsQueries address the stored descriptions

rather the documents themselves Images & video: the majority of stored

dataSolution: retrieval by text

Text retrieval is a well researched areaMost systems work with text, attributes

Page 10: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

10

Problem

Text descriptions for images and video contained in documents are not available e.g., the text in a web site is not always

descriptive of every particular image contained in the web site

Two approacheshuman annotationsfeature extraction

Page 11: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

11

Human Annotations

Humans insert attributes, captions for images and videos

inconsistent or subjective over time and users (different users do not give the same descriptions)

retrievals will fail if queries are formulated using different keywords or descriptions

time consuming process, expensive or impossible for large databases

Page 12: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

12

Feature Extraction

Features are extracted from audio, image and video

cheaper and replicable approach consistent descriptions but inexact low level feature (patterns, colors etc.) difficult to extract meaning different techniques for different data

Page 13: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

13

Furht at.al. 96

Architecture of a IR System for Images

Page 14: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

14

Multimedia IR in Practice

Relies on text descriptions Non-text IR

there is nothing similar to text-based retrieval systems

automatic IR is sometimes impossiblerequires human intervention at some

levelsignificant progress have been madedomain specific interpretations

Page 15: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

15

Ranking of IR Methods

Ranking in terms of complexity and accuracy attributestextaudioimagevideo

Combined IR based on two or more data types for more accurate retrievals

Complex data types (e.g.,video) are rich sources of information

accuracy complexity

Page 16: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

16

Similarity Searching

IR is not an exact processtwo documents are never identical!they can be “similar” searching must be “approximate”

The effectiveness of IR depends on thetypes and correctness of descriptions usedtypes of queries alloweduser uncertainly as to what he is looking forefficiency of search techniques

Page 17: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

17

Approximate IR

A query is specified and all documents up to a pre-specified degree of similarity are retrieved and presented to the user ordered by similarity

Two common types of similarity queries:range queries: retrieve all documents up to

a distance “threshold” Tnearest-neighbor queries: retrieve the k

best matches

Page 18: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

18

Distance - Similarity

Decide whether two documents are similardistance: the lower the distance, the

more similar the documents are similarity: the higher the similarity, the

more similar the documents areKey issue for successful retrieval: the

more accurate the descriptions are, the more accurate the retrieval is

Page 19: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

19

Retrieval Quality Criteria

Retrieval with as few errors as possible Two types of errors:

false dismissals or misses: qualifying but non retrieved documents

false positives or false drops: retrieved but not qualifying documents

A good method minimizes both Ranking quality: retrieve qualifying

documents before non-qualifying ones

Page 20: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

20

Evaluation of Retrieval

“Given a collection of documents, a set of queries and human expert’s responses to the above queries, the ideal system will retrieve exactly what the human dictated”

The deviations from the above are measured

collection the in relevant or number totalanswer in relevant retrieved or number

=

retrieved of numberanswer in relevant retrieved of number

=

recall

precision

Page 21: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

21

Precision-Recall Diagram

Measure precision-recall for 1, 2 ..N answers High precision means few false alarms High recall mean few false dismissals

precision

recall

1

1

0,5

0,5

the ideal method

Page 22: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

22

Harmonic Mean A single measure combining precision and recall

r: recall, p: precision F takes values in [0,1] F 1 as more retrieved documents are relevant F 0 as few retrieved documents are relevant F is high when both precision and recall are high F expresses a compromise between precision

and recall

pr

F 11 +2

=

Page 23: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

23

Ranking QualityThe higher the Rnorm the better the ability of a

method to retrieve correct answer before incorrect

A human expert evaluates the answer setThen take answers in pairs: (relevant,irrelevant)

S+ : pairs ranked correctly (the relevant entry has retrieved before the irrelevant one)

S- : pairs ranked incorrectly Smax

+: total ranked pairs

otherwise

SifR S

SS

norm1

0 )1( max-

21

max

-

Page 24: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

24

Searching Method

Sequential scanning: the query is matched with all stored documentsslow if the database is large or if matching is slow

The documents must be indexedhashing, inverted files, B-trees, R-treesdifferent data types are indexed and

searched separately

Page 25: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

25

Goals of IR

Summarizing, there are two general goals common to all IR systemseffectiveness: IR must be accurate

(retrieves what the user expects to see in the answer)

efficiency: IR must be fast (faster than sequential scanning)

Page 26: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

26

Approximate IR

A query is specified and all documents up to a pre-specified degree of similarity are retrieved and presented to the user ordered by similarity

Two common types of similarity queries:range queries: retrieve all documents up to

a distance “threshold” Tnearest-neighbor queries: retrieve the k

best matches

Page 27: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

27

Query Formulation

Must be flexible and convenient SQL queries constraints on all

attributes and data types may become very complex

Queries by example e.g., by providing an example document or image

Browsing: display headers, summaries, miniatures etc. or for refining the retrieved results

Page 28: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

28

Access Methods for Text

Access methods for text are interesting for at least 3 reasonsmultimedia documents contain text,

e.g., images often have captionstext retrieval has several applications in

itself (library automation, web search etc.)

text retrieval research has led to useful ideas like, vector space model, information filtering, relevance feedback etc.

Page 29: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

29

Text Queries

Single or multiple keyword queries context queries: phrases, word proximity boolean: keywords with and, or, not

Natural language: free text queries Structured search takes also text

structure into account and can be: flat or hierarchical for searching in titles,

paragraphs, sections, chapters hypertext: combines content-connectivity

Page 30: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

30

Keyword Matching

The whole collection is searched no preprocessingno space overheadupdates are easy

The user specifies a string (regular expression) and the text is parsed using a finite state automatonKMP, BMH algorithms

Page 31: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

31

Error Tolerant Methods

Methods that can tolerate errorsScan text one character at a time by

keeping track of matched characters and retrieve strings within a desired editing distance from queryWu, Manber 92, Baeza-Yates, Connet 92

Extension: regular expressions built-up by strings and operators: “pro (blem | tein) (s | ε) | (0 | 1 | 2)”

Page 32: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

32

Error Counting Methods

Retrieve similar words or phrases e.g., misspelled words or words with different pronunciation

editing distance: a numerical estimate of the similarity between 2 strings

phonetic coding: search words with similar pronunciation (Soundex, Phonix)

N-grams: count common N-length substrings

Page 33: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

33

Editing Distance

Minimum number of edit operations that are needed to transform the first string to the secondedit operation: insert, delete, substitute and

(sometimes) transposition of charactersd(si,ti) : distance between characters

usually d(si,ti) = 1 if si < > ti and 0 otherwiseedit(cordis,codis) = 1: r is deletededit(cordis, codris) = 1: transposition of rd

Page 34: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M. Petrakis Multimedia Information Retrieval

34

i 0 1 2 3 4 5 6

j a b b c d d

0 0 1 2 3 4 5 6

1 a 1 0 1 2 3 4 5

2 a 2 1 1 2 3 4 5

3 a 3 2 2 2 3 4 5

4 b 4 3 2 2 3 4 5

5 c 5 4 3 2 2 3 4

6 c 6 5 4 4 3 3 4

7 c 7 6 5 5 4 4 4

8 d 8 7 6 6 5 4 4

initialization cost

total cost

Page 35: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

35

Damerau-Levenstein Algorithm

Basic recurrence relation (DP algorithm)

edit(0,0) = 0;edit(i,0) = i; edit(0,j) = j;edit(i,j) = min{ edit(i-1,j) + 1, edit(i, j-1) + 1,

edit(i-1,j-1) + d(si,ti),

edit(i-2,j-2) + d(si,tj-1) + d(si-1,sj) + 1

}

Page 36: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M. Petrakis Multimedia Information Retrieval

36

Matching Algorithm Compute D(A,B)

#A, #B lengths of A, B 0: null symbol R: cost of an edit operationD(0,0) = 0for i = 0 to #A: D(i:0) = D(i-1,0) + R(A[i]0);for j = 0 to #B: D(0:j) = D(0,j-1) + R(0B[j]);for i = 0 to #A for j = 0 to #B {

1. m1 = D(i,j-1) + R(0B[j]);2. m2 = D(i-1,j) + R(A[i] 0); 3. m3 = D(i-1,j-1) + R(A[i] B[j]);4. D(i,j) = min{m1, m2, m3};

}

Page 37: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

37

Classical IR Models

Boolean modelsimple based on set theoryqueries as Boolean expressions

Vector space modelqueries and documents as vectors in

term spaceProbabilistic model

a probabilistic approach

Page 38: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

38

Text Indexing

A document is represented by a set of index terms that summarize document

contentsadjectives, adverbs, connectives are

less usefulmainly nouns (lexicon look-up)requires text pre-processing (off-line)

Page 39: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

39

Indexing

index

Data Repository

Page 40: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

40

Text Preprocessing

Extract index terms from textProcessing stages

word separation, sentence splittingchange terms to a standard form (e.g.,

lowercase)eliminate stop-words (e.g. and, is, the, …)reduce terms to their base form (e.g.,

eliminate prefixes, suffixes)construct mapping between terms and

documents (indexing)

Page 41: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

41

Text Preprocessing Chart

from Baeza – Yates & Ribeiro – Neto, 1999

Page 42: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

42

Text Indexing Methods

Main indexing methods:inverted files signature filesbitmaps

Size of indexmay exceed size of actual collection

compress index to reduce storage typically 40% of size of collection

Page 43: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

43

Inverted Files

Dictionary: terms stored alphabeticallyPosting list: pointers to documents

containing the termPosting info: document id, frequency of

occurrence, location in document, etc.dictionary indexed by B-trees, tries or

binary searchedPros: fast and easy to implementCons: large space overhead (up to

300%)

Page 44: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

44

Inverted Index

άγαλμααγάπη…δουλειά…πρωί…ωκεανός

index posting list

(1,2)(3,4)

(4,3)(7,5)

(10,3)

123456789

1011

………

documents

Page 45: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

45

Query Processing

1. Parse query and extract query terms2. Dictionary lookup

binary search

3. Get postings from inverted file one term at a time

4. Accumulate postings record matching document ids, frequencies, etc. compute weights

5. Rank answers by weights (e.g., frequencies)

Page 46: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

46

Signature Files

A filter for eliminating most of the non-qualifying documentssignature: short hash-coded

representation of document or querythe signatures are stored and searched

sequentiallysearch returns all qualifying documents

plus some false alarmsthe answer is searched to eliminate false

alarms

Page 47: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

47

Superimposed CodingEach word yields a bit pattern of size F with

m bits set to 1 and the rest left as 0size of F affects the false drop ratebit patterns are OR-ed to form the doc signaturefind documents having 1 in the same bits as the

query

word signature

data 001 000 110 010

base 000 010 101 001

document signature 001 010 111 011

Page 48: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

48

Bitmaps

For every term, a bit vector is storedeach bit corresponds to a documentset to 1 if term appears in document, 0

otherwisee.g., “word” 10100: the “word” appears

in documents 1 and 3 in a collection of 5 documents

Very efficient for Boolean queriesretrieve bit-vectors for each query term combine bit vectors with Boolean operators

Page 49: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

49

Comparison of Text Indexing Methods

Bitmaps: pros: intuitive, easy to implement and to processcons: space overhead

Signature files:pros: easy to implement, compact index, no

lexiconcons: many unnecessary accesses to documents

Inverted files:pros: used by most search enginescons: large space overhead, index can be

compressed

Page 50: E.G.M PetrakisMultimedia Information Retrieval 1 Information Retrieval (IR)  Deals with the representation, storage and retrieval of unstructured data.

E.G.M Petrakis Multimedia Information Retrieval

50

References

“Searching Multimedia Databases by Content”, C. Faloutsos, Kluwer Academic Publishers, 1996

“Modern Information Retrieval”, R. Baeza-Yates, B. Ribeiro-Neto, Addison Wesley, 1999

“Automatic Text Processing”, Gerard Salton, Addison Wesley, 1989

Information Retrieval Links:

http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.html