Top Banner
Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis E1-215b [email protected]
26

Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

Jun 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

Advanced Algorithm Design and Analysis (Lecture 5)

SW5 fall 2005Simonas Šaltenis [email protected]

Page 2: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 2

Text-Search Data Structures

Goals of the lecture: Finish discussing text-searching algorithms:

• Boyer-Moore-Horspool

Dictionary ADT for strings: • to understand the principles of tries, compact tries,

Patricia tries

Text-searching data structures:• to understand and be able to analyze text searching

algorithm using the suffix tree and Pat tree

Page 3: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 3

Reverse naïve algorithm

Why not search from the end of P? Boyer and Moore

Reverse-Naive-Search(T,P) 01 for s ← 0 to n – m02 j ← m – 1 // start from the end 03 // check if T[s..s+m–1] = P[0..m–1]04 while T[s+j] = P[j] do05 j ← j - 106 if j < 0 return s07 return –1

Running time is exactly the same as of the naïve algorithm…

Page 4: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 4

Occurrence heuristic

Boyer and Moore added two heuristics to reverse naïve, to get an O(n+m) algorithm, but it is complex

Horspool suggested just to use the modified occurrence heuristic: After a mismatch, align T[s + m–1] with the

rightmost occurrence of that letter in the pattern P[0..m–2]

Examples: • T= “detective date” and P= “date”

• T= “tea kettle” and P= “kettle”

Page 5: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 5

Shift table

In preprocessing, compute the shift table of the size |Σ|.

Example: P = “kettle” shift[e] =4, shift[l] =1, shift[t] =2, shift[k] =5 shift[any other letter] = 6

Example: P = “pappar” What is the shift table?

shift [w]={m−1−max {im−1∣P [i ]=w} if w is in P [0. .m−2] ,m otherwise.

Page 6: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 6

Boyer-Moore-Horspool Alg.

BMH-Search(T,P) 01 // compute the shift table for P01 for c ← 0 to |Σ|- 102 shift[c] = m // default values03 for k ← 0 to m - 204 shift[P[k]] = m – 1 - k05 // search06 s ← 0 07 while s ≤ n – m do 08 j ← m – 1 // start from the end 09 // check if T[s..s+m–1] = P[0..m–1]10 while T[s+j] = P[j] do11 j ← j - 112 if j < 0 return s13 s ← s + shift[T[s + m–1]] // shift by last letter14 return –1

Page 7: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 7

BMH Analysis

Worst-case running time Preprocessing: O(|Σ|+m) Searching: O(nm)

• What input gives this bound?

Total: O(nm)

Space: O(|Σ|) Independent of m

On real-world data sets very fast

Page 8: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 8

Comparison

Let’s compare the algorithms. Criteria: Worst-case running time

• Preprocessing• Searching

Expected running time Space used Implementation complexity

Page 9: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 9

Dictionary ADT for Strings

Dictionary ADT for strings – stores a set of text strings: search(x) – checks if string x is in the set insert(x) – inserts a new string x into the set delete(x) – deletes the string equal to x from

the set of strings Assumptions, notation:

n strings, N characters in total m – length of x Size of the alphabet d = |Σ|

Page 10: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 10

BST of Strings

We can, of course, use binary search trees. Some issues: Keys are of varying length A lot of strings share similar prefixes

(beginnings) – potential for saving space Let’s count comparisons of characters.

• What is the worst-case running time of searching for a string of length m?

Page 11: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 11

Tries

Trie – a data structure for storing a set of strings (name from the word “retrieval”): Let’s assume, all strings end with “$” (not in Σ)

b s

ea

r$

i

d

$

u

l

k

$ $

l

un

da

y

$

$

Set of strings: {bear, bid, bulk, bull, sun, sunday}

Page 12: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 12

Tries II

Properties of a trie: A multi-way tree. Each node has from 1 to d children. Each edge of the tree is labeled with a

character. Each leaf node corresponds to the stored

string, which is a concatenation of characters on a path from the root to this node.

Page 13: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 13

Search and Insertion in Tries

The search algorithm just follows the path down the tree (starting with Trie-Search(root, P[0..m]))

Trie-Search(t, P[k..m]) //searches for P in t01 if t is leaf then return true 02 else if t.child(P[k])=nil then return false03 else return Trie-Search(t.child(P[k]), P[k+1..m])

How would the delete work?

Trie-Insert(t, P[k..m]) //inserts string P into t01 if t is not leaf then //otherwise P is already present02 if t.child(P[k])=nil then 03 Create a new child of t and a “branch” starting

with that chlid and storing P[k..m] 04 else Trie-Insert(t.child(P[k]), P[k+1..m])

Page 14: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 14

Trie Node Structure

“Implementation detail” What is the node structure? = What is the

complexity of the t.child(c) operation?:• An array of child pointers of size d: waist of space,

but child(c) is O(1)• A hash table of child pointers: less waist of space,

child(c) is expected O(1)• A list of child pointers: compact, but child(c) is O(d)

in the worst-case• A binary search tree of child pointers: compact and

child(c) is O(lg d) in the worst-case

Page 15: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 15

Analysis of the Trie

Size: O(N) in the worst-case

Search, insertion, and deletion (string of length m): depending on the node structure:

O(dm), O(m lg d), O(m) Compare with the string BST

Observation: Having chains of one-child nodes is wasteful

Page 16: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 16

Compact Tries

Compact Trie: Replace a chain of one-child nodes with an edge

labeled with a string Each non-leaf node (except root) has at least

two children

b s

ea

r$

i

d

$

u

l

k

$ $

l

un

da

y

$

$

b sunear$

id$ ul

k$l$

day$$

Page 17: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 17

Compact Tries II

Implementation: Strings are external to the structure in one

array, edges are labeled with indices in the array (from, to)

Can be used to do word matching: find where the given word appears in the text. Use the compact trie to “store” all words in the

text Each child in the compact trie has a list of

indices in the text where the corresponding word appears.

Page 18: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 18

Word Matching with Tries

To find a word P: At each node (after matching P[0..k–1]), follow edge

(i,j), such that P[k..k+i-j] = T[i..j] If there is no such edge, there is no P in T, otherwise,

find all starting indexes of P when a leaf is reached.

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

(31,34)

(14,16)

(1,2)(17,18)

(19,19)(22,24)

(3,3)(8,11)

they think that we were there and there

(28,30)(4,5)

31

12

6

25,35 1

17

20

T:

Page 19: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 19

Word Matching with Tries II

Building of a compact trie for a given text: How do you do that? Describe the compact trie

insertion procedure Running time: O(N)

Complexity of word matching: O(m) What if the text is in external memory?

In the worst-case we do O(m) I/O operations just to access single characters in the text – not efficient

Page 20: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 20

Patricia trie

Patricia trie: a compact trie where each edge’s label (from, to) is

replaced by (T[from], to – from + 1)

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

(a,4)

(a,3)

(t,2)(w,2)

(_,1)(r,3)

(e,1)(i,4)

they think that we were there and there

(r,3)(y,2)

31

12

6

25,35 1

17

20

T:

Page 21: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 21

Querying Patricia Trie

Word prefix query: find all words in T, which start with P[0..m-1]

Patricia-Search(t, P, k) // searches for P in t01 if t is leaf then02 j ← the first index in the t.list03 if T[j..j+m-1] = P[0..m-1] then 04 return t.list // exact match 05 else if there is a child-edge (P[k],s) then06 if k + s < m then 07 return Patricia-Search(t.child(P[k]), P, k+s)08 else go to any descendent leaf of t and do the

check of line 03, if it is true, return lists of all descendent leafs of t, otherwise return nil

09 else return nil // nothing is found

Page 22: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 22

Analysis of the Patricia Trie

Idea of patricia trie – postpone the actual comparison with the text to the end: If the text is in external memory only O(1) I/O are

performed (if the trie fits in main-memory)

Build a Patricia Trie for word matching:

Usually binary patricia tries are used: Consider binary encoding of text (and queries) Each node in the tree has two children (left for 0, right

for 1) Edges are labeled just with skip values (in bits)

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40føtex har haft en fødselsdagT:

Page 23: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 23

Text-Search Problem

Input: Text T = “carrara” Pattern P = “ar”

Output: All occurrences of P in T

Reformulate the problem: Find all suffixes of T that

has P as a prefix! We already saw how to do a

word prefix query.

carrara arrara rrara rara ara ra a

Page 24: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 24

Suffix Trees

Suffix tree – a compact trie (or similar structure) of all suffixes of the text Patricia trie of suffixes is sometimes called a Pat

tree

carrara$1 2 3 4 5 6 7 8

a r

r$

carrara$

rara$ a$

rara$a

$ ra$

2 5

71

6 4

3

(a,1) (r,1)

(r,1)($,1)

(c,8)

(r,5) (a,2)

2 5

71

6 4

3

(r,5)(a,1)

(r,3)($,1)

Page 25: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 25

Pat Trees: Analysis

Text search for P is then a prefix query. Running time: O(m+z), where z is the number

of answers Just O(1) I/Os if the text is in external-memory

(independent of z)! The size of the Pat tree: O(N)

Why? Advantage of compression: the size of the

simple trie of suffixes would be in the worst-case N + (N-1)+ (N-2) + … 1 = O(N2)

Page 26: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 26

Constructing Suffix Trees

The naïve algorithm Insert all suffixes one after another: O(N2)

Clever algorithms: O(N) McCreight, Ukkonen Scan the text from left to right, use additional

suffix links in the tree Question: How does the the Pat tree looks

like after inserting the first five prefixes using the naïve algorithm?

Honolulu$1 2 3 4 5 6 7 8 9