Top Banner
String Matching String Matching
16

11 String Matching

Nov 22, 2014

Download

Documents

tieppv
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 11 String Matching

String Matching

String Matching

Page 2: 11 String Matching

String Matching Algorithms

Finding Patterns in a given Text

Datastructures: Tries, Suffix-Tries, Suffix Arrays

Algorithms:

Naive ApproachBoyer-MooreRabin-KarpKnuth-Morris-Pratt (KMP)

Literature: Dan Gusfield, Algorithms on strings, trees, and

sequences

CLRS (Cormen,. . .), Introduction to Algorithms

String Matching

Page 3: 11 String Matching

Naive Approach

Naive Approach

n = text.size();

m = pattern.size();

for s = 0 to n - m {

if (pattern[1 .. m] = text[s+1 .. s+m]) add_result(s);

}

For T = an, P = am and m = n/2 the worst case occurs, yieldinga running time of Θ(n2).

String Matching

Page 4: 11 String Matching

Rabin-Karp

Rabin-Karp

n=text.lenght();

m=pattern.length();

hpattern = hash(pattern)

htext = hash(text[0..m-1])

for s = 0 to n - m {

if (htext == hpattern)

if (pattern[1 .. m] = text[s .. s+m-1])

add_result(s);

htext = hash(s+1,s+m)

}

String Matching

Page 5: 11 String Matching

Properties of Rabin-Karp

Properties of Rabin-Karp-Algorithm

Worst case running time (as for the naive approach) isO((n − m + 1)m).

On average good, i.e. O(n + m).

String Matching

Page 6: 11 String Matching

Boyer-Moore

Compare right → left.

possible that some text chars are never compared

Good explanation in Dan Gusfield, Algorithms on strings trees

and sequences

Bad char shifts

String Matching

Page 7: 11 String Matching

Boyer-Moore, strong good suffix rule

(strong) good suffix rule

T: prstabstubabvqxrst

*

P: qcabdabdab

String Matching

Page 8: 11 String Matching

Boyer-Moore, strong good suffix rule

(strong) good suffix rule

T: prstabstubabvqxrst

*

P: qcabdabdab

String Matching

Page 9: 11 String Matching

Boyer-Moore, strong good suffix rule

(strong) good suffix rule

T: prstabstubabvqxrst

*

P: qcabdabdab

String Matching

Page 10: 11 String Matching

Boyer-Moore, strong good suffix rule

(strong) good suffix rule

T: prstabstubabvqxrst

*

P: qcabdabdab

P: qcabdabdab

String Matching

Page 11: 11 String Matching

Boyer-Moore, strong good suffix rule

(strong) good suffix rule

T: prstabstubabvqxrst

*

P: qcabdabdab

String Matching

Page 12: 11 String Matching

Boyer-Moore, strong good suffix rule

(strong) good suffix rule

T: prstabstubabvqxrst

*

P: qcabdabdab

P: qcabdabdab

String Matching

Page 13: 11 String Matching

Properties of Boyer-Moore

Properties of Boyer-Moore-Algorithm

Worst case if pattern is not in the text O(n).

Best case O(n/m) running time.

In practice one of the best known algorithms for stringmatching.

details see e.g.http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_s

String Matching

Page 14: 11 String Matching

Properties of KMP

Properties of KMP-Algorithm

Worst case running time is O(n).

In practice most of the time slower than Boyer Moor buteasier to code.

no details here

extension: Aho-Corasick for matching multiple strings in onepass

String Matching

Page 15: 11 String Matching

Tries

Tries

data structure for a set of strings

each node corresponds to a prefix of some string

each edge corresponds to a character

example stolen from wikipedia: to, tea, ten, i, in, and inn

it

eo n

nna

t i

in

inn

te

tea ten

to

3 12 9

7 5

11

String Matching

Page 16: 11 String Matching

Suffix-Trees/Tries/Arrays

Suffix-Tries/Trees

preprocessing the text not the pattern

tree containing every suffix of a text (size?)

Fast searching for any substring

trie→tree: one edge for paths without branches

there are linear time algorithm for suffix trees (clearly linearsize)

Suffix Arrays

array of length |S | listing the suffixes of S in ascending order

(simple) search in m log n time

simple implementation in O(n2 log n) and O(n) space oftensufficient

String Matching