Top Banner
1 Signature Files Information Retrieval: Data Structures a nd Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice H all, 1992. (Chapters 4)
28

Signature Files

Feb 01, 2016

Download

Documents

mieko

Signature Files. Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992. (Chapters 4). Signature Files. Characteristics Word-oriented index structures based on hashing - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Signature Files

1

Signature Files

Information Retrieval: Data Structures and Algorithms

by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.

(Chapters 4)

Page 2: Signature Files

2

Signature Files

Characteristics» Word-oriented index structures based on hashing

» Low overhead (10%~20% over the text size) at the cost of forcing a sequential search over the index

» Suitable for not very large texts

» Inverted files outperform signature files for most applications

Page 3: Signature Files

3

Structure

Use superimposed coding to create signature. Each text is divided into logical blocks. A block contains n distinct non-common words. Each word yields “word signature”. A word signature is a B-bit pattern, with m 1-bit.

» Each word is divided into successive, overlapping triplets. e.g. free --> fr, fre, ree, ee

» Each such triplet is hashed to a bit position. The word signatures are OR’ed to form block signature. Block signatures are concatenated to form the document signat

ure.

Page 4: Signature Files

4

Example

Example (n=2, B=12, m=4)word signaturefree 001 000 110 010text 000 010 101 001block signature 001 010 111 011

Search» Use hash function to determine the m 1-bit positions.

» Examine each block signature for 1’s bit positions that the signature of the search word has a 1.

Page 5: Signature Files

5

False Drop

false alarm (false hit, or false drop) Fd

the probability that a block signature seems to qualify, given that the block does not actually qualify.

Fd = Prob{signature qualifies/block does not}

For a given value of B, the value of m that minimizes the false drop probability is such that each row of the matrix contains “1”s with probability 0.5.

Fd = 2-m

m = B ln2/n

Page 6: Signature Files

documents

assume documents span exactly one logical block the size of document signature F = the size of block signature B

Sequential Signature File (SSF)

Page 7: Signature Files

7

Classification of Signature-Based Methods

CompressionIf the signature matrix is deliberately sparse, it can be compressed.

Vertical partitioningStoring the signature matrix column-wise improves the response time on the expense of insertion time.

Horizontal partitioningGrouping similar signatures together and/or providing an index on the signature matrix may result in better-than-linear search.

Page 8: Signature Files

8

Classification of Signature-Based Methods

Sequential storage of the signature matrix» without compression

sequential signature files (SSF)

» with compressionbit-block compression (BC)variable bit-block compression (VBC)

Vertical partitioning» without compression

bit-sliced signature files (BSSF, B’SSF)frame sliced (FSSF)generalized frame-sliced (GFSSF)

Page 9: Signature Files

9

Classification of Signature-Based Methods(Continued)

» with compressioncompressed bit slices (CBS)doubly compressed bit slices (DCBS)no-false-drop method (NFD)

Horizontal partitioning» data independent partitioning

Gustafson’s methodpartitioned signature files

» data dependent partitioning2-level signature files5-trees

Page 10: Signature Files

10

Criteria

the storage overhead the response time on single word queries the performance on insertion, as well as whether the

insertion maintains the “append-only” property

Page 11: Signature Files

11

Compression

idea» Create sparse document signatures on purpose.

» Compress them before storing them sequentially. Method

» Use B-bit vector, where B is large.

» Hash each word into one (or k) bit position(s).

» Use run-length encoding (McIlroy 1982).

Page 12: Signature Files

Compression using run-length encoding

data 0000 0000 0000 0010 0000base 0000 0001 0000 0000 0000management 0000 1000 0000 0000 0000system 0000 0000 0000 0000 1000block signature 0000 1001 0000 0010 1000

L1 L2 L3 L4 L5

[L1] [L2] [L3] [L4] [L5]where [x] is the encoded vale of x.

search: Decode the encoded lengths of all the preceding intervalsexample: search “data” (1) data ==> 0000 0000 0000 0010 0000 (2) decode [L1]=0000, decode [L2]=00, decode [L3]=000000disadvantage: search becomes low

Page 13: Signature Files

Bit-block Compression (BC)

Data Structure:(1) The sparse vector is divided into groups of consecutive bits (bit-blocks).(2) Each bit block is encoded individually.Algorithm:Part I. It is one bit long, and it indicates whether there are any “1”s in the bit-block (1) or the bit -block is (0). In the latter case, the bit-block signature stops here. 0000 1001 0000 0010 1000 0 1 0 1 1Part II. It indicates the number s of “1”s in the bit-block. It consists of s-1 “1” and a terminating zero. 10 0 0Part III. It contains the offsets of the “1”s from the beginning of the bit-block. 0011 10 00 說明: 4bits,距離為 0, 1, 2, 3,編碼為 00, 01, 10, 11block signature: 01011 | 10 00 | 00 11 10 00

Page 14: Signature Files

14

Bit-block Compression (BC)(Continued)

Search “data”(1) data ==> 0000 0000 0000 0010 0000(2) check the 4th block of signature 01011 | 10 0 0 | 00 11 10 00(4) OK, there is at least one setting in the 4th bit-block.(5) Check furthermore. “0” tells us there is only one setting in the 4th bit-clock. Is it the 3rd bit?(6) Yes, “10” confirms the result.

Discussion:(1) Bit-block compression requires less space than Sequential Signature File for the same false drop probability.(2) The response time of Bit-block compression is lightly less then Sequential Signature File.

Page 15: Signature Files

15

Vertical Partitioning

ideaavoid bringing useless portions of the document signature in main memory

methods» store the signature file in a bit-sliced form or in a frame-sliced form

» store the signature matrix column-wise to improve the response time on the expense of insertion time

Page 16: Signature Files

Bit-Sliced Signature Files (BSSF)

Transposed bit matrix

transpose

represent

documents

documents(document signature)

Page 17: Signature Files

F bit-files

search: (1) retrieve m bit-files. e.g., the word signature of free is 001 000 110 010 the document contains “free”: 3rd, 7th, 8th, 11th bit are set i.e., only 3rd, 7th, 8th, 11th files are examined. (2) “and” these vectors. The 1s in the result N-bit vector

denote the qualifying logical blocks (documents).(3) retrieve text file through pointer file.

insertion: require F disk accesses for a new logical block (document), one for each bit-file, but no rewriting

documents

Page 18: Signature Files

18

Frame-Sliced Signature File (FSSF)

Ideas» random disk accesses are more expensive than sequential ones

» force each word to hash into bit positions that are closer to each other in the document signature

» these bit files are stored together and can be retrieved with a few random accesses

Procedures» The document signature (F bits long) is divided into k frames of s

consecutive bits each.

» For each word in the document, one of the k frames will be chosen by a hash function.

» Using another hash function, the word sets m bits in that frame.

Page 19: Signature Files

19

documents

frames

Each frame will be kept in consecutive disk blocks.

Frame-Sliced Signature File (Cont.)

Page 20: Signature Files

20

FSSF (Continued)

Example (n=2, B=12, s=6, f=2, m=3)Word Signaturefree 000000 110010text 010110 000000

doc. signature 010110 110010 Search

» Only one frame has to be retrieved for a single word query. I.E., only one random disk access is required.e.g., search documents that contain the word “free”->because the word signature of “free” is placed in 2nd frame,only the 2nd frame has to be examined.

» At most k frames have to be scanned for an k word query. Insertion

» Only f frames have to be accessed instead of F bit-slices.

Page 21: Signature Files

21

Vertical Partitioning with Compression

idea» create a very sparse signature matrix» store it in a bit-sliced form» compress each bit slice by storing the position of the 1s in

the slice.

Page 22: Signature Files

22

Compressed Bit Slices (CBS)

Rooms for improvements» Searching

– Each search word requires the retrieval of m bit files.

– The search time could be improved if m was forced to be “1”.

» Insertion– Require too many disk accesses (equal to F, which is typically

600-1000).

Page 23: Signature Files

23

Compressed Bit Slices (CBS)(Continued)

Let m=1. To maintain the same false drop probability, F has to be increased.

To compress each bit file, we store only the positions of the “1”s.

For unpredictable number of “1”s, we store them in buckets of size Bp.

documents

Size of a

signature

Sparse bit matrix

Page 24: Signature Files

h(“base”)=30

Obtain the pointers to the relevant documents frombuckets

Hash a word toobtain bucket address

Differences with inversion » The directory (hash

table) is sparse

» The actual word is stored nowhere

» Simple structure

Page 25: Signature Files

Doubly Compressed Bit Slices

h1(“base”)=30 h2(“base”)=011Follow the pointers of postingbuckets to retrieve the qualifyingdocuments.

Distinguish synonyms partially.

Idea:compressthe sparsedirectory當 S變小碰撞在一起的的機會變大,採用中間 buckets為了區別真碰撞和假碰撞,多了一個 hashfunction

Page 26: Signature Files

No False Drops Method

Using pointer to the wordin the text file

To distinguish between synonyms completely.

Page 27: Signature Files

Horizontal Partitioning

documents

1. Goal: group the signatures into sets, partitioning the signature matrix horizontally.2. Grouping criterion

Page 28: Signature Files

28

Partitioned Signature Files

Using a portion of a document signature as a signature key to partition the signature file.

All signatures with the same key will be grouped into a so-called “module”.

When a query signature arrives,» examine its signature key and look for the corresponding modules

» scan all the signatures within those modules that have been selected