Top Banner
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval
23

1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

1

CS 430: Information Discovery

Lecture 4

Data Structures for Information Retrieval

Page 2: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

2

Course Administration

• The Wednesday evening classes have been moved to Hollister 110.

Introduction to Perl

• Classes will be held on Wednesday evenings, September 19 and October 3.

• Before the first class, look at the CS 430 web site and attempt the (optional) Assignment 0.

(These classes and Assignment 0 are optional.)

Page 3: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

3

Inverted Files: Search for Keywords

Index file: Stores list of terms (keywords). Designed for rapid searching and processing range queries. May be held in memory.

Postings file: Stores list of postings for each term. Designed for rapid evaluation of Boolean operators. May be stored sequentially.

Document file: [Repositories for the storage of document collections are covered in CS 502.]

Page 4: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

4

Index File Structures: Binary Tree

elk

cat hog

bee dog fox

ant gnu

Page 5: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

5

Binary Tree

Advantages

Can be searched quickly

Convenient for batch updating

Easy to add an extra term

Economical use of storage

Disadvantages

Poor for sequential processing, e.g., comp*

Tree tends to become unbalanced

If the index is held on disk, important to optimize the number of disk accesses

Page 6: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

6

Binary Tree

Calculation of maximum depth of tree.

Illustrates importance of balanced trees.

Worst case: depth = n

O(n)

Ideal case: depth = log(n + 1)/log 2

O(log n)

Page 7: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

7

Right Threaded Binary Tree

Threaded tree:

A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's in-order predecessor and an empty right child link to refer to its in-order successor.

Right-threaded tree:

A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained.

Knuth vol 1, 2.3.1, page 325.

Page 8: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

8

Right Threaded Binary Tree

From: Robert F. Rossa

Page 9: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

9

B-trees

B-tree of order m:

A balanced, multiway search tree:

• Each node stores many keys

• Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys.

• If ki is the ith key in a given internal node

-> all keys in the (i-1)th child are smaller than ki

-> all keys in the ith child are bigger than ki

• All leaves are at the same depth

Page 10: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

10

B+-tree

B+-tree:

• A B-tree is used as an index

• Data is stored in the leaves of the tree, known as buckets

50 65

10 25 55 59 70 81 90

... D9 D51 ... D54 D66... D81 ...

Example: B+-tree of order 2, bucket size 4

Page 11: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

11

B-tree Discussion

For a discussion of B-trees, see Frake, Section 2.3.1, pages 18-20.

• B-trees combine fast retrieval with moderately efficient updating.

• Bottom-up updating is usual fast, but may require recursive tree climbing to the root.

• The main weakness is poor storage utilization; typically buckets are only 0.69 full.

• Various algorithmic improvements increase storage utilization at the expense of updating performance.

Page 12: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

12

Signature Files: Sequential Search without Inverted File

Inexact filter: A quick test which discards many of the non-qualifying items.

Advantages

• Much faster than full text scanning -- 1 or 2 orders of magnitude• Modest space overhead -- 10% to 15% of file• Insertion is straightforward

Disadvantages

• Sequential searching no good for very large files• Some hits are false hits

Page 13: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

13

Signature Files

Signature size. Number of bits in a signature, F.

Word signature. A bit pattern of size F with m bits set to 1 and the others 0.

The word signature is calculated by a hash function.

Block. A sequence of text that contains D distinct words.

Block signature. The logical OR of all the word signatures in a block of text.

Page 14: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

14

Signature Files

Example

Word Signature

free 001 000 110 010text 000 010 101 001

block signature 001 010 111 011

F = 12 bits in a signature

m = 4 bits per word

D = 2 words per block

Page 15: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

15

Signature Files

A query term is processed by matching its signature against the block signature.

(a) If the term is in the block, its word signature will always match the block signature.

(b) A word signature may match the block signature, but the word is not in the block. This is a false hit.

The design challenge is to minimize the false drop probability, Fd .

Frake, Section 4.2, page 47 discussed how to minimize Fd. The rest of this chapter discusses enhancements to the basic algorithm.

Page 16: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

16

Search for Substring

In some information retrieval applications, any substring can be a search term.

Tries, implemented using suffix trees, provide lexicographical indexes for all the substrings in a document or set of documents.

Page 17: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

17

Tries: Search for Substring

Basic concept

The text is divided into unique semi-infinite strings, or sistrings. Each sistring has a starting position in the text, and continues to the right until it is unique.

The sistrings are stored in (the leaves of) a tree, the suffix tree. Common parts are stored only once.

Each sistring can be associated with a location within a document where the sistring occurs. Subtrees below a certain node represent all occurrences of the substring represented by that node.

Suffix trees have a size of the same order of magnitude as the input documents.

Page 18: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

18

Tries: Suffix Tree

Example: suffix tree for the following words:

begin beginning between bread break

b

e rea

gin tween d k

_ ning

Page 19: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

19

Tries: Sistrings

A binary example

String: 01 100 100 010 111

Sistrings: 1 01 100 100 010 1112 11 001 000 101 113 10 010 001 011 14 00 100 010 1115 01 000 101 11

6 10 001 011 17 00 010 1118 00 101 11

Page 20: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

20

Tries: Lexical Ordering

7 00 010 1114 00 100 010 1118 00 101 115 01 000 101 111 01 100 100 010 111

6 10 001 011 13 10 010 001 011 12 11 001 000 101 11

Unique string indicated in blue

Page 21: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

21

Trie: Basic Concept

7

4 8

5 1

2

6 3

0

0

0

0

0

0

0

0

0

1

1

1

11

1

1

Page 22: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

22

Patricia Tree

7

4 8

5 1

2

6 3

0

0

0

00

0

0

0

1

1

1

110 1

1

1

2 2

3 3 4

5

Single-descendant nodes are eliminated.

Nodes have bit number.

Page 23: 1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

23

Oxford English Dictionary