Page 1
JASS04 - Sequential Pattern MatchingTobias Reichl 1
Joint Advanced Student School 2004
Complexity Analysis of String Algorithms
Sequential Pattern Matching:Analysis of Knuth-Morris-Pratt type algorithms
using the Subadditive Ergodic Theorem
19 April 2023
Page 2
JASS04 - Sequential Pattern MatchingTobias Reichl 2
Overview
1. Pattern Matching• Sequential Algorithms• Knuth-Morris-Pratt-Algorithm
2. Probabilistic tools• Subadditive Ergodic Theorem• Martingales and Azuma's Inequality
3. Analysis of KMP-Algorithms• Properties of KMP• Establishing subadditivity• Analysis
Page 3
JASS04 - Sequential Pattern MatchingTobias Reichl 3
Pattern Matching
• Text , pattern
• Comparison:
• Alignment Position:
for some k.
abcdexxxxxabxxxabcxxxabcde
Pattern pText t
Pattern-text comparison: Pattern-text comparison: M(l,k)=1M(l,k)=1
Alignment position AP
nt1mp1
otherwise0 tocompared is 1
),(kplt
klM
1)),1(( kkAPM
Page 4
JASS04 - Sequential Pattern MatchingTobias Reichl 4
Sequential Algorithms - Definition
i. Semi-sequential: AP are non-decreasing.
ii. Strongly semi-sequential: (i) and comparisons
define non-decreasing text positions .
iii. Sequential: (i) and
iv. Strongly sequential: (i), (ii) and (iii)
ii klM ,
11
1)1(1,
klkl ptklM
il
abcdexxxxxabxxxabcxxxabcde
Text is compared only if following a prefix of the pattern. Example:
Page 5
JASS04 - Sequential Pattern MatchingTobias Reichl 5
Example: Naive / brute force algorithm
• Every text position is alignment position.
• Text is scanned until...– pattern is found - then done.– mismatch occurs - then shift by one and retry.
• Sequential algorithm.
abcde
xxxxxabxxxabcxxxabcde
abcdeabcde
+1
+1
+1
Page 6
JASS04 - Sequential Pattern MatchingTobias Reichl 6
• Idea: (Morris-Pratt) Disreagard APs already known not to be followed by a prefix of p.
• Knowledge:– Already processed pattern– Pre-processing of p.
• Strongly sequential algorithm.
Knuth-Morris-Pratt type algorithms (1)
xxxxxabxxxabcxxxabcde
ababcdeababcde
+S
Page 7
JASS04 - Sequential Pattern MatchingTobias Reichl 7
Knuth-Morris-Pratt type algorithms (2)
• Morris-Pratt:
• Knuth-Morris-Pratt:
}}:0min{;min{ )1(1
11
skk
s ppskS
}}:min{;min{ )1(1
11
sksk
kk
skks ppandppskS
xxxxxabxxxabcxxxabcde
ababcdeababcde
xxxxxabxxxabcxxxabcde
ababcdeababcde
(KMP also skips mismatching letters)
Page 8
JASS04 - Sequential Pattern MatchingTobias Reichl 8
• Overall complexity:
• Pattern or text is a realization of random sequence:
• Question: complexity of KMP?
Pattern Matching - Complexity
],1[],[
, ,,
mksrl
sr klMptc
nn cc :,1
nC
Page 9
JASS04 - Sequential Pattern MatchingTobias Reichl 9
Fekete (1923)
• Subadditivity:
• Superadditivity:
Subadditivity – Deterministic Sequence
m
x
n
x m
m
n
n 1inflim
nmnm xxx
m
x
n
x m
m
n
n 1suplim
nmnm xxx
Page 10
JASS04 - Sequential Pattern MatchingTobias Reichl 10
Example: Longest Common Subsequence
• Superadditive:
• Hence:
abcdeabcdfabcab
ababcafbcdabcde
abcdeabc
ababcafb
dfabcab
cdabcde
LCS: "abcabcdabc" (10) LCS: "abcab" (5), "dabc" (4)
}1
,11:max{
21
21,1
njjjand
niiiwhereKkforYXKL
k
kjin kk
nmmn LLL ,,1,1
8284.0suplim
?
1
m
LE
n
a m
m
n
n
(Conjectured by Steele in 1982)
Page 11
JASS04 - Sequential Pattern MatchingTobias Reichl 11
Subadditivity – "Almost subadditive"
DeBruijn and Erdös (1952)
• positive and non-decreasing sequence
• "Almost subadditive":
12
k
k
k
c
nmnmnm cxxx
m
x
n
x m
m
n
n 1inflim
nc
Page 12
JASS04 - Sequential Pattern MatchingTobias Reichl 12
Subadditive Ergodic Theorem
Kingman (1976), Liggett (1985)
i.
ii. is a stationary sequence
iii. does not depend on m
iv.
nmmn XXX ,,0,0
1,: )1(, nXk knnk
1,, kX kmm
00,01,0 where][and][ cncXEXE n
XEm
XE
n
XE m
m
n
n
:
][inf
][lim ,0
1
,0
(a.s.)lim ,0 n
X n
n
Page 13
JASS04 - Sequential Pattern MatchingTobias Reichl 13
Almost Subadditive Ergodic Theorem
Deriennic (1983)
• Subadditivity can be relaxed to
with
• Then, too:
nnmmn AXXX ,,0,0
(a.s.)lim ,0 n
X n
n
0lim
nAE nn
Page 14
JASS04 - Sequential Pattern MatchingTobias Reichl 14
Martingales
• A sequenceis a martingale with respect to the filtration if for all :
• defines a random variable depending on the knowledge contained in .
nnnnn YFYEXXXYE |,,,| 1101
nYE
0,,1 nXXfY nn
),,( 0 nn XXF 0n
nXX ,,1 nn FYE |1
Page 15
JASS04 - Sequential Pattern MatchingTobias Reichl 15
Martingale Differences
• The martingale difference is defined as
so that:
• Observe:
1 nnn YYD
n
iin DYY
10
0
]|[]|[]|[ 11
nn
nnnnnn
YY
FYEFYEFDE
Page 16
JASS04 - Sequential Pattern MatchingTobias Reichl 16
Azuma's Inequality (1)
• Let be a martingale• Define the martingale difference as
(The mean of the same element but depending on different knowledge)
• Observe:
),,( 1 nnn XXfY
nnnnn YEFYEYFYE 0|and|
1|| inini FYEFYED
nnnnn
n
ii YEYFYEFYED
0
1
||
(Deviation from the mean)
Page 17
JASS04 - Sequential Pattern MatchingTobias Reichl 17
Hoeffding's Inequality
• Let be a martingale
• Let there exist constant
• Then:nnnn cDYY 1
n
i i
n
ii
on
c
xxD
xYY
1
2
2
1 2exp2Pr
Pr
0nnY
nc
Page 18
JASS04 - Sequential Pattern MatchingTobias Reichl 18
Azuma's Inequality (2)
• Summary:– If is bounded, we know how to assess the
deviation from the mean.– So now we need a bound on .
• Trick: Let be an independent copy of .
• Then:iX̂ iX
inin
inin
FXXXfE
FXXXfE
|,,ˆ,,
|,,,,
1
11
iD
iD
Page 19
JASS04 - Sequential Pattern MatchingTobias Reichl 19
Azuma's Inequality (3)
• Hence:
• And we can postulate:
inininin
inininin
i
FXXXfEFXXXfE
FXXXfEFXXXfE
D
|,,ˆ,,|,,,,
|,,,,|,,,,
11
111
ii cD
Page 20
JASS04 - Sequential Pattern MatchingTobias Reichl 20
Azuma's Inequality (4)
• Let be a martingale
• If there exists constant such that
where is an independent copy of
• Then:
ininnin cXXXfXXXf ,,ˆ,,,,,, 11
2
1
2
11
2exp2
,,ˆ,,,,,,Pr
Pr
i
n
i
ninnin
nn
c
x
xXXXfEXXXf
xYEY
nnn XXfY ,,1
ic
iXiX̂
Page 21
JASS04 - Sequential Pattern MatchingTobias Reichl 21
KMP: Unavoidable alignment positions
• A position in the text is called unavoidable AP if for any r,l it's an AP when run on .
• KMP-like algorithms have the same set of unavoidable alignment positions
where
• Example:
n
l lUU1
}1,}{minmin{
1
lptU l
klk
l
milir andlrt
abcde
xxxxxabxxxabcxxxabcde
llU
Page 22
JASS04 - Sequential Pattern MatchingTobias Reichl 22
Pattern Matching: l-convergence
• An algorithm is l-convergent if there exists an increasing sequence of unavoidable alignment positions satisfying
• l-convergence indicates the maximum size "jumps" for an algorithm.
lUU ii 1
n
iiU 1
Page 23
JASS04 - Sequential Pattern MatchingTobias Reichl 23
KMP: Establishing m-convergence
• Let AP be an alignment position
• Define:
•
• Hence: and so KMP-like algorithms are m-convergent.
mAPl lUmlmp l 1
mAPU l
Page 24
JASS04 - Sequential Pattern MatchingTobias Reichl 24
KMP: Establishing subadditivity (1)
• If (number of comparisons) is subadditive we can prove linear complexity of KMP-like algorithms.
• We have to show: is (almost) subadditive:
• Approach:An l-convergent sequential algorithm satisfies:
lmmccc nrrn 2,,1,1
nc
accc nrrn ,,1,1
nc
Page 25
JASS04 - Sequential Pattern MatchingTobias Reichl 25
KMP: Establishing subadditivity (2)
• Proof:– : the smallest unavoidable AP greater than r.– We split into
and . nUrn rccc ,,1,1
nUnr rcc ,,
r
nc ,1
rU
nrrn ccc ,,1,1 rU
nUr rcc ,,1
nUnr rcc ,,
Page 26
JASS04 - Sequential Pattern MatchingTobias Reichl 26
KMP: Establishing subadditivity (3)
• Comparisons done after r with AP before r:
• Comparisons with AP between r and :
• No more than m comparisons can be saved at
rrU
S1
S2S2
21 1, mAPiiMS
rAP ri
lmiiAPMSrUAPr mi
),1(2
Contributing to and
nc ,1 rc ,1
Contributing to only
nc ,1
Contributing to and
nc ,1 nU rc ,
rU
?
???
??
rU
Page 27
JASS04 - Sequential Pattern MatchingTobias Reichl 27
• Comparisons with AP between r and :
• No more than m comparisons can be saved at
KMP: Establishing subadditivity (4)
rrU
S3S3
lmiiAPMSrU
rAP i
1
3 ),1(
Contributing to only
nrc ,
Contributing to and
nrc , nU rc ,
rU
??
??
rU
Page 28
JASS04 - Sequential Pattern MatchingTobias Reichl 28
KMP: Establishing subadditivity (5)
• So we are able to bound:
• We have shown: is (almost) subadditive:
• Now we are able to apply the Subadditive Ergodic Theorem.
lmmSSSccc nrrn 2321,,1,1
nc
accc nrrn ,,1,1
Page 29
JASS04 - Sequential Pattern MatchingTobias Reichl 29
KMP: Different Modeling Assumptions
• Deterministic Model:Text and pattern are non random.
• Semi-Random Model:Text is a realization of a stationary and ergodic sequence, pattern is given.
• Stationary model:Both text and pattern are realizations of a stationary and ergodic sequence.
Page 30
JASS04 - Sequential Pattern MatchingTobias Reichl 30
KMP: Applying the Subadditive Ergodic Theorem
• We have shown: is (almost) subadditive
• Deterministic Model:
• Semi-Random Model:
• Stationary Model:
nc
)(
,maxlim 1 p
n
ptcnt
n
)(
)(lim 2 p
n
pCE nt
n
3
,lim n
CE npt
n
(a.s.))()(
lim 2 pn
pCn
n
Page 31
JASS04 - Sequential Pattern MatchingTobias Reichl 31
KMP: Applying Azuma's Inequality
• satisfies:
where is an independent copy of .
• So, using Azuma's Inequality:
• is concentrated around its mean:
211 2,,ˆ,,,,,, mTTTCTTTC ninnin
11
22exp2Pr
2
2
omn
nnnCn
11 onCE n
nC
iTiT̂
nC
Page 32
JASS04 - Sequential Pattern MatchingTobias Reichl 32
Conclusion
• Using the Subadditive Ergodic Theorem we can show there exists a linearity constant for the worst and average case resp. KMP has linear complexity.
• The Subadditive Ergodic Theorem proves the existence of this constant but says nothing how to compute it.
• Using Azuma's Inequality we can show that the number of comparisons is well concentrated around its mean.