Top Banner
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods
35

1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

Dec 18, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

1

CS 430 / INFO 430 Information Retrieval

Lecture 6

Boolean Methods

Page 2: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

2

Course Administration

Assignments

You are encouraged to use existing Java or C++ classes, e.g., to manage data structures.

If you do, you must acknowledge them in your report. In addition, for a data structure, you should explain why this structure is appropriate for the use you make of it.

Page 3: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

3

Porter Stemmer

A multi-step, longest-match stemmer.

M. F. Porter, An algorithm for suffix stripping. (Originally published in Program, 14 no. 3, pp 130-137, July 1980.) http://www.tartarus.org/~martin/PorterStemmer/def.txt

Notation

v vowel(s)c constant(s)(vc)m vowel(s) followed by constant(s), repeated m times

Any word can be written: [c](vc)m[v]

m is called the measure of the word

Page 4: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

4

Porter's Stemmer

Multi-Step Stemming Algorithm

Complex suffixes

Complex suffixes are removed bit by bit in the different steps. Thus:

GENERALIZATIONS

becomes GENERALIZATION (Step 1)becomes GENERALIZE (Step 2)becomes GENERAL (Step 3)becomes GENER (Step 4)

[In this example, note that Steps 3 and 4 appear to be unhelpful for information retrieval.]

Page 5: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

5

Porter Stemmer: Step 1a

Suffix Replacement Examples

sses ss caresses -> caress

ies i ponies -> poni ties -> ti

ss ss caress -> caress

s cats -> cat

At each step, carry out the longest match only.

Page 6: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

6

Porter Stemmer: Step 1b

Conditions Suffix Replacement Examples

(m > 0) eed ee feed -> feedagreed -> agree

(*v*) ed null plastered -> plasterbled -> bled

(*v*) ing null motoring -> motorsing -> sing

*v* - the stem contains a vowel

Page 7: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

7

Porter Stemmer: Step 5a

(m>1) e -> probate -> probatrate -> rate

(m=1 and not *o) e -> cease -> ceas

*o - the stem ends cvc, where the second c is not w, x or y (e.g. -wil, -hop).

Page 8: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

8

Porter Stemmer: Results

Suffix stripping of a vocabulary of 10,000 words

Number of words reduced in step 1: 3597 step 2: 766 step 3: 327 step 4: 2424 step 5: 1373Number of words not reduced: 3650

The resulting vocabulary of stems contained 6370 distinct entries. Thus the suffix stripping process reduced the size of the vocabulary by about one third.

Page 9: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

9

Exact Matching (Boolean Model)

Query DocumentsIndex database

Mechanism for determining whether a document matches a query.

Set of hits

Page 10: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

10

Boolean Queries

Boolean query: two or more search terms, related by logical operators, e.g.,

and or not

Examples:

abacus and actor

abacus or actor

(abacus and actor) or (abacus and atoll)

not actor

Page 11: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

11

Boolean Diagram

A B

A and B

A or B

not (A or B)

Page 12: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

12

Adjacent and Near Operators

abacus adj actor

Terms abacus and actor are adjacent to each other as in the string

"abacus actor"

abacus near 4 actor

Terms abacus and actor are near to each other as in the string

"the actor has an abacus"

Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

Page 13: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

13

Evaluation of Boolean Operators

Precedence of operators must be defined:

adj, near high

and, not

or low

Example

A and B or C and B

is evaluated as

(A and B) or (C and B)

Page 14: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

14

Evaluating a Boolean Query

3 19 22 2 19 29

To evaluate the and operator, merge the two inverted lists

with a logical AND operation.

Examples: abacus and actor

Postings for abacus

Postings for actor

Document 19 is the only document that contains both terms, "abacus" and "actor".

Page 15: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

15

Evaluating an Adjacency Operation

Examples: abacus adj actor

Postings for abacus

Postings for actor

Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent.

3 94 19 719 212 22 56

2 66 19 213 29 45

Page 16: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

16

Query Matching: Boolean Methods

Query: (abacus or asp*) and actor

1. From the index file (word list), find the postings file for:

"abacus" every word that begins "asp" "actor"

2. Merge these posting lists. For each document that occurs in any of the postings lists, evaluate the Boolean expression to see if it is true or false.

Step 2 should be carried out in a single pass.

Page 17: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

17

Use of Postings File for Query Matching

1 abacus

3 94

19 7

19 212

22 56

2 actor

2 66

19 213

29 45

3 aspen

5 43

4 atoll

11 3

11 70

34 40

Page 18: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

18

Query Matching: Vector Ranking Methods

Query: abacus asp*

1. From the index file (word list), find the postings file for:

"abacus" every word that begins "asp"

2. Merge these posting lists. Calculate the similarity to the query for each document that occurs in any of the postings lists.

3. Sort the similarities to obtain the results in ranked order.

Steps 2 and 3 should be carried out in a single pass.

Page 19: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

19

Contrast of Ranking with Matching

With matching, a document either matches a query exactly or not at all

• Encourages short queries• Requires precise choice of index terms• Requires precise formulation of queries (professional training)

With retrieval using similarity measures, similarities range from 0 to 1 for all documents

• Encourages long queries, to have as many dimensions as possible• Benefits from large numbers of index terms• Benefits from queries with many terms, not all of which need match the document

Page 20: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

20

Problems with the Boolean model

Counter-intuitive results:

Query q = a and b and c and d and eDocument d has terms a, b, c and d, but not e

Intuitively, d is quite a good match for q, but it is rejected by the Boolean model.

Query q = a or b or c or d or eDocument d1 has terms a, b, c, d, and eDocument d2 has term a, but not b, c, d or e

Intuitively, d1 is a much better match than d2, but the Boolean model ranks them as equal.

Page 21: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

21

Problems with the Boolean model (continued)

Boolean is all or nothing

• Boolean model has no way to rank documents.

• Boolean model allows for no uncertainty in assigning index terms to documents.

• The Boolean model has no provision for adjusting the importance of query terms.

Page 22: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

22

Extending the Boolean model

Term weighting

• Give weights to terms in documents and/or queries.

• Combine standard Boolean retrieval with vector ranking of results

Fuzzy sets

• Relax the boundaries of the sets used in Boolean retrieval

Page 23: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

23

Ranking methods in Boolean systems

SIRE (Syracuse Information Retrieval Experiment)

Term weights

• Add term weights to documents

Weights calculated by the standard method of

term frequency * inverse document frequency.

Ranking

• Calculate results set by standard Boolean methods

• Rank results by vector distances

Page 24: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

24

Relevance feedback in SIRE

SIRE (Syracuse Information Retrieval Experiment)

Relevance feedback is particularly important with Boolean retrieval because it allow the results set to be expanded

• Results set is created by standard Boolean retrieval

• User selects one document from results set

• Other documents in collection are ranked by vector distance from this document

[Relevance feedback will be covered in a later lecture.]

Page 25: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

25

Boolean model as sets

A

d

d is either in the set A or not in A.

Page 26: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

26

Boolean model as fuzzy sets

A

d

d is more or less in A.

Page 27: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

27

Fuzzy Sets: Basic concept

• A document has a term weight associated with each index term. The term weight measures the degree to which that term characterizes the document.

• Term weights are in the range [0, 1]. (In the standard Boolean model all weights are either 0 or 1.)

• For a given query, calculate the similarity between the query and each document in the collection.

• This calculation is needed for every document that has a non-zero weight for any of the terms in the query.

Page 28: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

28

Fuzzy Sets

Fuzzy set theory

dA is the degree of membership of an element to set A

intersection (and)

dAB = min(dA, dB)

union (or)

dAB = max(dA, dB)

Page 29: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

29

Fuzzy Sets

Fuzzy set theory example

standard fuzzy set theory set theory

dA 1 1 0 0 0.5 0.5 0 0

dB 1 0 1 0 0.7 0 0.7 0

and dAB 1 0 0 0 0.5 0 0 0

or dAB 1 1 1 0 0.7 0.5 0.7 0

Page 30: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

30

MMM: Mixed Min and Max model

Terms: a1, a2, . . . , an

Document: d, with index-term weights: d1, d2, . . . , dn

qor = (a1 or a2 or . . . or an)

Query-document similarity:

S(qor, d) = or * max(d1, d2,.. , dn) + (1 - or) * min(d1, d2,.. , dn)

With regular Boolean logic, all di = 1 or 0, or = 1

Page 31: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

31

MMM: Mixed Min and Max model

Terms: a1, a2, . . . , an

Document: d, with index-term weights: d1, d2, . . . , dn

qand = (a1 and a2 and . . . and an)

Query-document similarity:

S(qand, d) = and * min(d1,.. , dn) + (1 - and)* max(d1,.. , dn)

With regular Boolean logic, all di = 1 or 0, and = 1

Page 32: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

32

MMM: Mixed Min and Max model

Experimental values:

all di = 1 or 0

and in range [0.5, 0.8]

or > 0.2

Computational cost is low. Retrieval performance much improved.

Page 33: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

33

Other Models

Paice model

The MMM model considers only the maximum and minimum document weights. The Paice model takes into account all of the document weights. Computational cost is higher than MMM.

P-norm model

Document d, with term weights: dA1, dA2, . . . , dAn

Query terms are given weights, a1, a2, . . . ,an

Operators have coefficients that indicate degree of strictness

Query-document similarity is calculated by considering each document and query as a point in n space.

Page 34: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

34

Test data

CISI CACM INSPEC

P-norm 79 106 210

Paice 77 104 206

MMM 68 109 195

Percentage improvement over standard Boolean model (average best precision)

Lee and Fox, 1988

Page 35: 1 CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods.

35

Reading

E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended Boolean Models, Frake, Chapter 15

Methods based on fuzzy set concepts