Top Banner
www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems
21

Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

Jan 12, 2016

Download

Documents

Jennifer Melton
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

CSE3201/CSE4500 Information Retrieval Systems

Signature Based Text Retrieval Systems

Page 2: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

2

Signature File for Text Retrieval

• A “signature” is created as an abstraction of a document.

• All the signatures that represent the documents in the collection are kept in a file called “signature file”.

Page 3: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

3

Word Signature(WS)

• A word signature – is a fixed-length bit-string represents a word.– is described by

> The length (N)> A number of bits set to 1(k)

1 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0

N=24

k=7

Page 4: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

4

Word Signature Generation

• Use a hash function to find the location of the bit(s) that will be set on.

• Using triplets of characters to generate word signature.

– divide the word into overlapping triplets.

– For each triplet of characters:> convert the characters to a numeric value (can be ASCII

representation of the character).> Use the the number as the input to the hash function.> The hash function will produce a number which represent the bit

position of the triplet in the word signature.

Page 5: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

5

Signature Generator Algorithm

Set hash_value to 0

for each character in the triplet do

hash_value:=(hash_value*137+character ASCIIvalue)mod 256

K values

Page 6: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

6

Word Signature Generation – simplified example

• Example:

– A signature 111000111001 is generated for the word “signature”.

• The position is read from left to right

-si sig ign gna nat atu tur ure re-

12 73 23 9 12 8

1 1 1 0 0 0 1 1 1 0 0 1

signature

Hash function

Position of the bit set to 1

1

Page 7: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

7

Document Signature (DS)

• Document Signature can be created using two methods:– concatenation of word signatures.– superimposed coding.

Page 8: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

8

Document Signature – Concatenation of WS

• The length of document signatures (DS) can vary. • A fixed number of bits may precede the document

signature (DS) to indicate the length of DS.• It is possible to fix the length of the Document Signature

(DS). – The length can be set to equal the longest document in the

collection.– Extra “0” bits are padded to the shorter documents.

Page 9: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

9

Document Signature –Superimposed Coding

• Each document is divided into blocks containing a constant number of distinct words.

• To create a block signature, perform OR operation on all the words in the block.

free 001 000 110 010

text 000 010 101 001

Block signature 001 010 111 011

Page 10: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

10

Document Signature – Superimposed Coding

• To create the document signature, all the block signatures are superimposed.

Page 11: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

11

Query Signature

• Query will be converted to a block signature as in the document.

• Example:

free 0 0 1 0 0 0 1 1 0 0 1 0

Text 0 0 0 0 1 0 1 0 1 0 0 1

Block/Query

0 0 1 0 1 0 1 1 1 0 1 1

Page 12: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

12

Matching the Query and Document Signature

• Premise:– The positions of the bits set to 1 represent the existence

of particular words in the query or document. • A relevant document is document that has a signature

with bits set to 1 at the same position of the bits in the query’s signature.

• The relevant document’s signature does not have to be an exact match of the query’s signature.

• Example:– Query: 0100– Match document signatures: 1111, 0111, 0110, 0100.

Page 13: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

13

Query on Signature File

Query

001 010 111 011

0 0 1 0 0 0 1 1 1 0 1 1

0 0 1 1 1 1 1 1 1 0 1 1

0 0 1 0 1 0 1 0 1 0 1 1

0 0 1 0 1 0 1 1 1 0 1 0

1 1 1 0 1 0 1 1 1 0 1 1

0 0 1 1 0 0 1 1 1 0 1 1

0 0 1 0 1 0 1 1 1 1 1 1

No

No

No

Yes

YesNo

Yes

Match? Perform AND operation between the query and block signature, if ( result – query) = 0, they are matched

Page 14: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

14

Signature File Structure

• Sequential– During searching, each signature will be compared to

query signature.– Time consuming because:

> Memory size is limited, hence all signatures cannot be loaded to the memory at once.

> May result in multiple number of I/O operations.

• We need a file structure for the signature file that minimise the I/O operation.

• Bit-Sliced Signature– At the maximum, only N (the size of the signature) number

of records need to be retrieved.

Page 15: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

15

Matrix Transposed

2313

2212

2111

232221

131211

xx

xx

xx

xxx

xxxT

xij -> xji

fc

eb

da

fed

cbaT

Page 16: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

16

Bit-Sliced

0 0 1 0 0 0 1 1 1 0 1 1

0 0 1 1 1 1 1 1 1 0 1 1

0 0 1 0 1 0 1 0 1 0 1 1

0 0 1 0 1 0 1 1 1 0 1 0

0 0 0 0

0 0 0 0

1 1 1 1

0 1 0 0

0 1 1 1

0 1 0 0

1 1 1 1

1 1 0 1

1 1 1 1

0 0 0 0

1 1 1 1

1 1 1 0Bit slicedsequential

N bits

N records

d1

d4

d2d3

Query: 001 010 111 011

dn

d1 d2 d3 d4 dn

Page 17: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

17

Bit Sliced Signature File

• Retrieval– If ith bit in the query signature is set to 1, retrieve

the ith signature block/record.– If there is n number of bits are set to 1 in the

query, only n number of records needs to be retrieved.

Page 18: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

18

Bit Slice Signature File

0 0 0 0

0 0 0 0

1 1 1 1

0 1 0 0

0 1 1 1

0 1 0 0

1 1 1 1

1 1 0 1

1 1 1 1

0 0 0 0

1 1 1 1

1 1 1 0

Query: 001 010 111 011

1 1 1 1

0 1 1 1

1 1 1 1

1 1 0 1

1 1 1 1

1 1 1 1

1 1 1 0

Match, because all bits in this column is set to 1 (the 2nd block).

Retrieved records

Page 19: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

19

Bit Sliced Signature File

• Advantages:– Smaller number of records are retrieved -> faster

retrieval.• Disadvantages:

– An update operation become a very costly exercise.

Page 20: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

20

False Drop

• False drop occurs when a document’s signature matches a query’s signature but the query’s word does not match any word in the document.

• It is possible because 2 distinct blocks may have the same signatures due to:– the hashing algorithm– superimposed coding

Page 21: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.

www.monash.edu.au

21

Relation Between the Signature Properties and False Drop

• The rate of false drop depends on:– The size of the signature (N bits)

> Increase in N will decrease the false drop

– The size of bits set to 1(k bits)> Increase in k to a certain level will decrease the false

drop

– The number of unique words per-block> Decrease in the number of unique word per-block will

decrease the false drop.