Top Banner
Indexing and Searching The main techniques
56

Lecture4- Indexing and Searching I

Apr 07, 2018

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 1/56

Indexing and Searching

The main techniques

Page 2: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 2/56

Introduction

There are 2 ways to search a text

• First: Scan the text sequentially (online searching).

 – This can be done when the text is small (i.e., a few

megabytes),

 – if the text collection is very volatile (i.e., undergoes

modifications very frequently)

 – If the index space overhead cannot be afforded.

Page 3: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 3/56

Introduction• Second: Build data structures over the text (called

indices) – It speeds up the search.

 – It is worthwhile when the text collection is large and semi-

static.

 – Most real databases are like this.

• E.g : dictionaries, Web search engines, journal archives.

Semi-static collections are collections that can be updated at reasonably regular

intervals

Page 4: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 4/56

Introduction• Nowadays, the most successful techniques for medium

size databases (say up to 200Mb) combine online andindexed searching.

Page 5: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 5/56

Introduction

• We cover two main indexing techniques

 – Inverted files

 – Suffix arrays

Page 6: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 6/56

Introduction

• Before covering these portions you should be familiar

with

 – Sorted arrays

 –

Binary search trees – B-trees

 – Hash tables

 – Tries.

Page 7: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 7/56

Introduction

• Sorted arrays

 – An array whose items are kept sorted,

 – so searching is faster

Page 8: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 8/56

Introduction

• Binary search trees

 – A binary tree

 – For each internal node x stores an element

 – The element stored in the left subtree of  x <=  x and

elements stored in the right subtree of  x >=x 

 –

Both the left and right subtrees must also be binary searchtrees.

Page 9: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 9/56

Binary Tree

Each

node has

at most 2

children

Page 10: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 10/56

Binary Search Tree

Page 11: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 11/56

Binary Search Tree

Page 12: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 12/56

Introduction

• B-trees

 – A B-tree is a specialized multi way tree designedespecially for use on disk.

 –

Used when part or all of the tree must bemaintained in secondary storage such as a magnetic

disk.

 – An indexing technique most commonly used in

databases and file systems

Page 13: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 13/56

Introduction

• B-trees

 – A multiway tree of order m is an ordered tree whereeach node has at most m children.

 –

The following is a multiway search tree of order 4

Page 14: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 14/56

Introduction

Page 15: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 15/56

Introduction• B-trees (contd..)

 – Pointers to data are placed in a balance treestructure so that all references to any data can be

accessed in an equal time frame.

 – Data in B-tree is kept sorted

• so that searching, inserting and deleting can be done in

logarithmic amortized time

 – A b-tree tries to minimize the number of disk

accesses. 

Page 16: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 16/56

Introduction• B-trees Example

Page 17: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 17/56

Introduction• B-trees Example

Page 18: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 18/56

Introduction• Searching a B-Tree for Key 21

Page 19: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 19/56

IntroductionInserting Key 33 into a B-Tree (w/ Split)

Page 20: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 20/56

IntroductionInserting Key 33 into a B-Tree (w/ Split) (contd..)

Page 21: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 21/56

IntroductionInserting Key 33 into a B-Tree (w/ Split) (contd..)

Page 22: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 22/56

IntroductionInserting Key 33 into a B-Tree (w/ Split) (contd..)

Page 23: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 23/56

IntroductionInserting Key 33 into a B-Tree (w/ Split) (contd..)

Page 24: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 24/56

Introduction• Hash table

 –

A data structure that uses a hash function to efficiently mapcertain identifiers or keys (e.g., person names) to associated

values (e.g., their telephone numbers).

 –

The hash function is used to transform the key into theindex (the hash) of an array element (the slot or bucket )

where the corresponding value is to be sought. 

 – E.g : Division Method

d

Page 25: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 25/56

Introduction• Hash table

 –

123456123467

123450

 – 123456 % 10 = 6 (the remainder is 6 when dividing

by 10)

123467 % 10 = 7 (the remainder is 7)

123450 % 10 = 0 (the remainder is 0)

d

Page 26: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 26/56

Introduction

Page 27: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 27/56

Tries

Trie , is an ordered tree data structure that is used tostore an array where the keys are usually strings

• It can be used to do a fast search in a large text

• The term trie comes from the word "retrieval".

• Used to implement the dictionary abstract data type

(ADT) where basic operations like search, insert, anddelete can be performed

Page 28: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 28/56

Tries

They can be used for encoding and compression

• They can be used in regular expression search and

approximate string matching

Page 29: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 29/56

Non Compact and Compact Tries

A non compact trie is one in which every edge of theunderlying tree represents a symbol of the alphabet.

• Let's construct the trie from the following 5 strings: BIG,

BIGGER, BILL, GOOD, GOSH.

d

Page 30: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 30/56

Non Compact and Compact Tries

Page 31: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 31/56

Non Compact Tries

• When we look for the string GOOD, we start at the root

and we follow the G O  OD edges

• If we want to look for the string BAD, we start from the

root, follow the B edge and find out that there is no A edge after. Thus BAD is not in the text.

• The above structure is rather wasteful because each

edge represents a single symbol.

• Not practical for huge texts

C i

Page 32: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 32/56

Compact Tries

• This type of trie resembles the one in figure above

except that chains which lead to leaves are trimmed.

• This is illustrated in next figure

C T i

Page 33: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 33/56

Compact Tries

C T i

Page 34: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 34/56

Compact Tries

The compact form

of the trie is in the

figure

C t T i

Page 35: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 35/56

Compact Tries

• The number of leaves is n+1 where n is the number of 

input strings.• In the leaves, we may store either the strings

themselves or pointers to the strings (that is, integers).

T i ll d "PATRICIA"

Page 36: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 36/56

Tries called "PATRICIA"

• "PATRICIA" stands for "practical algorithm to retrieve

information coded in alphanumeric".• The difference is that an edge can be labeled with more

than one character.

All the unary nodes will be collapsed.

T i ll d "PATRICIA"

Page 37: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 37/56

Tries called "PATRICIA"

T i ll d "PATRICIA"

Page 38: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 38/56

Tries called "PATRICIA"

The very

compact trie

will look as

follows:

Tries called "PATRICIA"

Page 39: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 39/56

Tries called "PATRICIA"

• Binary PATRICIA tries has only 2 symbols per edge

S ffi T

Page 40: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 40/56

Suffix Tree• The suffix tree T(x) of string x[1..n] is the compacted trie

of all suffixes x[i..n] for i = 1,..,n+1, i.e. including theempty suffix 

• Allows for a particularly fast implementation of many

important string operations.

• The suffix tree for a string S is a tree (more specifically a

trie) whose edges are labeled with strings, such that each

suffix of S corresponds to exactly one path from the tree'sroot to a leaf.

S ffi T

Page 41: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 41/56

Suffix Tree• The idea behind suffix tree is to assign to each symbol in

a text an index corresponding to its position in the text.

 – ie: First symbol has index 1, last symbol has indice n= #of 

symbols in text.

• In the tree we use indices instead of the actual object.

S ffi t

Page 42: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 42/56

Suffix tree• The advantages are:

 –

It requires less storage space. – We do not have to worry how the text is represented (bin, ASCII,

etc)

 – We do not have to store the same object twice. (no duplicate) 

S ffi t i

Page 43: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 43/56

Suffix trie

• We begin by giving a position to every suffix in the text.

We can now build a SUFFIX Trie for all n suffixes of the

text.

• E.g.

 –TEXT: G O O G O L $

 – POSITION: 1 2 3 4 5 6 7

Suffix trie

Page 44: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 44/56

Suffix trie

The resulting tree has n leaves and height n

S ffi

Page 45: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 45/56

Suffix tree• The suffix tree is created by TRIMMING (compacting +

collapsing every unary node) of the suffix TRIE

• The following is a picture of a compact suffix tree 

S ffi

Page 46: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 46/56

Suffix tree

Suffix tree

Page 47: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 47/56

Suffix tree

• In suffix tree we can store pointers rather than words in

the leaves.

• Also we can replace every string by a pair of indices,

(a,b), where a is the index of the beginning of the string

and b the index of the end of the string.• i.e: We write

 – (3,7) for OGOL$

 – (1,2) for GO

 – (7,7) for $

Suffix tree

Page 48: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 48/56

Suffix tree

• The corresponding suffix tree looks like this

Search in suffix tree

Page 49: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 49/56

Search in suffix tree

• Pseudo-code for searching in suffix tree:

 – Start at root

 – Go down the tree by taking each time the corresponding

bifurcation

 – If S correspond to a node then return all leaves in subtree

 – If S encountered a NIL pointer then S is not in the tree

Search in suffix tree

Page 50: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 50/56

Search in suffix tree

• If S = "GO" we take the GO bifurcation and return:

GOOGOL$,GOL$. 

If S = "OR" we take the O bifurcation and then we hit aNIL pointer so "OR" is not in the tree.

Applications of suffix tree

Page 51: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 51/56

Applications of suffix tree

• Exact matching

• Common substrings, with applications

• Matching statistics

• Suffix arrays

• Genome-scale projects

Exact Matching

Page 52: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 52/56

Exact Matching

• Given string x and pattern y, report where y occurs in x 

• Pattern ata occurs at position 2 in tatat

Exact Matching

Page 53: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 53/56

Exact Matching

• Given string x and pattern y, report where y occurs in x 

• Pattern tatt does not occur in tatat

Assumptions in indexing and searching

Page 54: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 54/56

Assumptions in indexing and searching

• We make the following assumptions.

 – We call n the size of the text database.

 – Whenever a pattern is searched, we assume that it is of length

m, which is much smaller than n.

 – We call M the amount of main memory available.

 – The modifications which a text database undergoes are

additions, deletions, and replacements of pieces of text of size

n' < n.

Reference

Page 55: Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 55/56

Reference

• Modern Information Retrieval by Yates

• http://www.bluerwhite.org/btree/ 01/08/2011

• http://cis.stvincent.edu/carlsond/swdesign/btree/btree.

html 01/08/2011 01/08/2011

http://www.cs.princeton.edu/~rs/AlgsDS07/09BalancedTrees.pdf  01/08/2011

• http://www.cs.uregina.ca/Links/class-info/210/Hash/  

01/08/2011

• http://www.cs.auckland.ac.nz/~jmor159/PLDS210/hash

 _tables.html 01/08/2011