https://courses.edx.org/courses/PekingX/04830050x/2T2014 / Ming Zhang“ Data Structures and Algorithms “ Data Structures and Algorithms(4) Instructor: Ming Zhang Textbook Authors: Ming Zhang, Tengjiao Wang and Haiyan Zhao Higher Education Press, 2008.6 (the "Eleventh Five-Year" national planning textbook)
66
Embed
Data Structures and Algorithms 4 · PDF file– Strings always use “the whole string” as the operating ... String Abstract Data Type C++ standard string library ... To store the
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
// g is the cursor of T, compare pattern P with the gth position of S,
// if fails, continue to loop
for (int g= startindex; g <= T.length() - P.length(); g++) {
for (int j=0; ((j<P.length()) && (T[g+j]==P[j])) ; j++) ;
if (j == P.length())
return g;
}
return(-1); // The end of ‘for’, or startingindex is too large, the match fails
}
4.3 Pattern Matching for Strings
41
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
The original pattern matching algorithm
----- complexity analysis
•m≤n Assume that the length of the target string T’
is n, the length of the pattern P is m, m≤n
– In the worst case, each loop is not successful ,
the number of comparisons will be (n-m+1) .
– The time that every “same matching” takes is
time of the comparision of P and T character by
character. In worst case, a total of m times.
Thus, the worst time for entire algorithm is:
O( m n )
4.3 Pattern Matching for Strings
42
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Naïve Matching Algorithm :worst case
• Compare pattern with every substring of target
String that has a length of n
– Target String like: an-1
X
– Pattern like am-1
b
• The total number of
comparisons:
– m(n – m + 1)
• Time complexity:
– O(mn)
4.3 Pattern Matching for Strings
a ba a a a a a a a a
ba a a a a a X
T =
P =
ba a a a a a X
ba a a a a a X
ba a a a a a X
ba a a a a a √
43
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Naïve Matching Algorithm :
Best case-find patterns
• Find patterns on the m-th position in front of the target
string, assign m=5
• The total number of
comparisons:m
• Time complexity: O(m)
4.3 Pattern Matching for Strings
AAAAAAAAAAAAAAAAAAAAAH
AAAAA Five times comparison
44
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Naïve Matching Algorithm :
Best case-don’t find patterns
• Always mismatch on the first character
• The total number of
comparisons
– n – m + 1
• Time complexity:
– O(n)
4.3 Pattern Matching for Strings
AAAAAAAAAAAAAAAAAAAAAH
OOOOH one time comparison
AAAAAAAAAAAAAAAAAAAAAH
OOOOH one time comparison
AAAAAAAAAAAAAAAAAAAAAH
OOOOH one time comparison
AAAAAAAAAAAAAAAAAAAAAH
OOOOH one time comparison
…………
AAAAAAAAAAAAAAAAAAAAAH
OOOOH one time
comparison
45
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Thinking:redundant operation of
naïve algorithm
• The reason why naïve algorithm is slow is
the redundant operation.
• e.g.,
– From 1 we can know): p5 t
5, p
0=t
0, p
1
= t1,at the same time , from p
0p
1we can
get p0t
2
So when you moves P one place right, the
2nd
comparison is sure to be unequal.
The comparison is redundant
– How many positions should you move P
right to eliminate the redundancy
operation without losing “matched-string”?
4.3 Pattern Matching for Strings
T a b a c a a b a c c a b a c a b a a
P a b a c a b
1)p5 T5 P move one place right
T a b a c a a b a c c a b a c a b a a
P a b a c a b
2)p0 T1 P move one place right
T a b a c a a b a c c a b a c a b a a
P a b a c a b
3)p1 T3 P move one place right
T a b a c a a b a c c a b a c a b a a
P a b a c a b
…….
46
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Outline
• The Basic Concept of Strings
• The Storage Structure of Strings
• The Implementation of Strings’ Operations
• Pattern Matching for Strings
– Naïve algorithm
– KMP algorithm
47
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Match without Backtracking
• In matching process,once pjis not equal to t
i,that’s:
P.substr(1,j-1) == T.substr(i-j+1,j-1)
But pj t
i
– Which character pk
should be used to compare with tiin p
?
– Determine the number of right-moving of digits
– It is clear that k < j, and when j changes, k will change too
• Knuth-Morrit-Pratt (KMP) algorithm
– The value of k only depends on pattern P itself, it
doesn’t have relations with target string T
4.3 Pattern Matching for Strings
48
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
KMP algorithm
T t0
t1
… ti-j-1
ti-j
ti-j+1
ti-j+2
… ti-2
ti-1
ti… t
n-1
‖ ‖ ‖ ‖ ‖
P p0
p1
p2
… pj-2
pj -1
pj
we have ti-j
ti-j+1
ti-j+2
… ti-1
= p0
p1
p2 …p
j-1(1)
p0
p1
… pj-2
pj -1
if p0
p1
…pj-2
p1
p2
…pj-1
(2)
You can immediately conclude:
p0
p1
…pj-2
ti-j+1
ti-j+2
… ti-1
(naive matching) Next trip will not match, jump over
p0
p1
… pj-2
pj -1
4.3 Pattern Matching for Strings
naïve for next trip
ti
pj
a eb c d e f a b c d f fT =
fa b c d e fP = X
fa b c d e fX
49
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
It’s same,if p0
p1
…pj-3
p2
p3 …p
j-1
It doesn’t match in the next step,because
p0
p1
…pj-3
ti-j+2
ti-j+3
… ti-1
Until ,“k” appears(the length of head and tail string),it makes
p0
p1
…pk p
j-k-1p
j-k…p
j-1
and p0
p1
…pk-1
= pj-k
pj-k+1
…pj-1
ti-k
ti-k+1
… ti-1
ti
‖ ‖ ‖
pj-k
pj-k+1
… pj-1
pj
‖ ‖ ‖
p0
p1
… pk-1
pk
4.3 Pattern Matching for Strings
Pattern moves j-k bits right
?
So p0
p1
…pk-1
= ti-k
ti-k+1
… ti-1
a eb c d e f a b c d f fT =
fa b c d e fP = X
fa b c d e fX
50
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
String feature vector: N
Assume that P is consisting of m chars,noted as
P = p0
p1p
2p
3……p
m-1
Use Feature Vector N to represent the distribution
characteristic of pattern P, which is consisting of m
feature numbers n0
…nm-1
, noted as
N = n0n
1n
2n
3……n
m-1
N is also called as the next array, each element nj
corresponds to next[j]
4.3 Pattern Matching for Strings
51
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Feature vector for String N: Constructor
• The feature number of the j-th position of P is nj
,The longest head and tail string is k
– Head string: p0
p1
… pk-2
pk-1
– Tail String: pj-k
pj-k+1
... pj-2
pj-1
4.3 Pattern Matching for Strings
1, j 0
next[ j] max{k: 0 k j & P[0...k-1] P[j-k...j-1]},
0,
时候
首尾配串最长k
其他
j==0
otherwise
If K exists
52
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
P = a a a b a a a
0 1 2 3 4 5 6
N = -1 0 1 2 1 0 1 2 3 4
a ca
7 8 9
T = a b a a a a a
0 1 2 3 4 5 6
a ba
7 8 9
a a a ca b
10 11 12 13 14
P = a a a b a a aa caX i=2, j=2, N[j]=1i=2, j=1, N[j]=0
a a a b a a aa caX
i=7, j=4, N[4]=1
a a a b a a aa ca
X (should be 3)
X
Miss it!
4.3 Pattern Matching for Strings
53
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Example for Pattern Matching for Strings
4.3 Pattern Matching for Strings
a b a b a b a b a b a b bT =
P = a b a b a b bX
√
a b a b a b bX
a b a b a b bX
a b a b a b b
i=6,j=6, N[j]=4
i=8,j=6, N[j]=4
i=10, j=6, j’=4
P = a b a b a b b
0 1 2 3 4 5 6
N = -1 0 0 1 2 3 4
0 1 2 3 4 5 6 7 8 9 10 11 12
54
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
KMP matching algorithm
int KMPStrMatching(string T, string P, int *N, int start) {int j= 0; // subscript variable of patternint i = start; // subscript variable of target stringint pLen = P.length( ); // length of patternint tLen = T.length( ); // length of target stringif (tLen - start < pLen) // if the target is shorter than
// the pattern, matching can not succeedreturn (-1);
while ( j < pLen && i < tLen) {// repeat comparisons to matchif ( j == -1 || T[i] == P[j])
i++, j++;else j = N[j];
}if (j >= pLen)
return (i-pLen); // be careful with the subscriptelse return (-1);
}
4.3 Pattern Matching for Strings
55
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Algorithm Framework for Seeking the Feature Value
• Feature value nj( j>0, 0≤ n
j+1≤ j ) is
recursively defined,defined as follows:1. n
0 = -1,for n
j+1 with j>0,assume that the
feature value of the previous position is nj
,let k = nj;
2. When k≥0 and pj≠ p
k,let k = n
k; let Step
2 loop until the condition is not satisfied.
3. nj+1
= k+1 ;// k == -1 or pj== p
k
4.3 Pattern Matching for Strings
56
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Feature vector of String: N —
non-Optimized versionint findNext(string P) {
int j, k;int m = P.length( ); // m is the length of pattern Passert( m > 0); // if m=0, exit int *next = new int[m]; // open up an integer array in dynamic storage area.assert( next != 0); // if opening up integer array fails,exitnext[0] = -1; j = 0; k = -1;while (j < m-1) {
while (k >= 0 && P[k] != P[j])// ff not equal, use kmp to look for head and tail substring
k = next[k]; // k recursively looking forwardj++; k++; next[j] = k;
}return next;
}
4.3 Pattern Matching for Strings
57
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Seeking feature vector N
4.3 Pattern Matching for Strings
N = 1 2 3 0 1
P = a a a b a a a
0 1 2 3 4 5 6
a ca
7 8 9
2 3 4
j= k = 0 01
Head substring→a
2 1
Head substring→a a
3 2
Head substring→a a a
4 32
0-1
10
Head substring→ a
56 1
Head substring→ a a
7 2
Head substring→ a a a
8 3
Head substring→a a a a
9 43210
58
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Pattern moves j-k bit right
ti-j
ti-j+1
ti-j+2
… ti-k
ti-k+1
… ti-1
ti
‖ ‖ ‖ ‖ ‖ ‖
p0
p1
p2
… pj-k
pj-k+1
… pj-1
pj
‖ ‖ ‖
p0
p1
… pk-1
pk
4.3 Pattern Matching for Strings
?
p0
p1
…pk-1
= ti-k
ti-k+1
… ti-1
tip
j, p
j==p
k?
59
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
KMP Matching
Target a a b c b a b c a a b c a a b a b c
a b c a a b a b c
a b c a a b a b c
This line is redundant a b c a a b a b c
a b c a a b a b c
a b c a a b a b c
P[3]==P[0], P[3] T[4], one more time
comparison is redundant
4.3 Pattern Matching for Strings
N[1]= 0
N[3] = 0
N[6]= 2
N[0] = -1
j 0 1 2 3 4 5 6 7 8
P a b c a a b a b c
K 0 0 0 1 1 2 1 2
60
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Feature vector of String: N —
Optimized version
int findNext(string P) {int j, k;int m = P.length( ); // m is the length of pattern Pint *next = new int[m]; // open up an integer array in dynamic storage area.next[0] = -1; j = 0; k = -1;while (j < m-1) { // if j<m, the code will be out of the border
while (k >= 0 && P[k] != P[j])//if not equal, use kmp to look for head and tail substring k = next[k]; // k looks forward recursively
j++; k++; if (P[k] == P[j])
next[j] = next[k]; // finding value k isn’t affected by the optimizationelse next[j] = k; // no optimization if you cancel the “if” judgment
}return next;
}
4.3 Pattern Matching for Strings
61
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Comparison of next arrays
4.3 Pattern Matching for Strings
j 0 1 2 3 4 5 6 7 8
P a b c a a b a b c
k 0 0 0 1 1 2 1 2
pk==p
j? ≠ ≠ == ≠ == ≠ == ==
next[j] -1 0 0 -1 1 0 2 0 0
Non-optimized version
Optimized version
62
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Time Analysis for the KMP algorithm
• The statement “j = N[j];” in the loop will not execute
more than n times. Otherwise,
– Because every time the statement “j = N[];” is executed,
j decreases inevitably (minus at least by one)
– Only “j++” can increase j
– Thus, if the statement “j==N[j]” is executed more than
n times, j will become smaller than -1. It’s impossible
(sometimes j becomes -1, but it will be increased by 1
and becomes 0 immediately)
• The time for constructing the N array is O(m)
Therefore, the time complexity of KMP is O(n+m)
4.3 Pattern Matching for Strings
63
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Summary: Single-matching algorithm
4.3 Pattern Matching for Strings
AlgorithmTime efficiency for
preprocessing
Time efficiency for
matching
Naïve matching algorithm 0 Θ(n m)
KMP Θ(m) Θ(n)
BM Θ(m)Best (n/m),
Worst Θ(nm)
shift-or, shift-and Θ(m+|Σ|) Θ(n)
Rabin-Karp Θ(m)Average (n+m),
WorstΘ(nm)
Finite state automaton Θ(m |Σ|) Θ(n)
64
目录页
Ming Zhang “Data Structures and Algorithm"
Chapter Four
Strings
Different Versions of the Feature Vector
4.3 Pattern Matching for Strings
If match fails on the j-th character, let j=next[j]
If match fails on the j-th character, let j=next[j-1]