Efficient algorithms for (δ,γ,α)-matching Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland [email protected]. pl PSC, Prague, August 2006 Kimmo Fredriksson Dept. of Computer Science Univ. of Joensuu, Finland [email protected].fi
Efficient algorithms for ( δ , γ , α )-matching. Kimmo Fredriksson Dept. of Computer Science Univ. of Joensuu, Finland [email protected]. Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland [email protected]. PSC, Prague, August 2006. Problem setting. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Efficient algorithmsfor (δ,γ,α)-matching
Szymon GrabowskiComputer Engineering Dept.,Tech. Univ. of Łódź, Poland
For example, it’s not relevant for music information retrieval (MIR)
and molecular biology.
Several approximate matching models have thus been developed...
String matching in its classic form: given text T = t0t1 ... tn–1, and pattern P = p0p1 ... pm–1
over a finite alphabet Σ of size σ, report all occurences of P in T.
Such simple problem variant (exact matching)is not very useful for many applications.
Problem setting
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
3
Models & applications – music information retrieval
We allow classes of characters: the classes are continuous intervals (of equal width, 2δ+1, for all pattern positions).
This corresponds to handling little distortions of the melody (singer / whistler unskilled or under influence...).
Limitation on the sum of individual errors γ (< mδ).
Gaps also allowed – this is to skip ornamentation (esp. in classical music). We assume all gaps are in [0, α] range.
Transposition invariance – the key of the melody can be arbitrary, i.e. everything can be shifted up or down
by a fixed value.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matchingFuture work, hopefully...
4
Problem we consider here
(δ,γ,α)-matching
Two symbols a, b Σ delta-match ( we write a =δ b ) iff |a – b| δ.
We say that a pattern P (δ,γ,α)-matches the text substring ti0 ti1 ... ti(m–1),
if pj =δ tij for j {0 ... m–1},where 0 < ij+1 – ij α+1,
and
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
1
0
m
jijj tp
5
Previous work on similar models
(δ, α)-matching:Crochemore et al., 2002: O(mn) time (worst, avg, and best case).
Cantone et al., 2005a: also O(mn) in every case to find not only the end positions of the occurences but also all the matching sequences.
Cantone et al., 2005b: achieving O(n) on avg (for constant α) and retaining O(mn) in the worst case.
Navarro & Raffinot, 2003; Cantone et al., 2005b: nondeterministic finite automaton with O(n mα / w) worst case time.
Along these lines: Fredriksson & Grabowski, 2006: more compact automaton with O(n m log(α) / w) worst case time.
Fredriksson & Grabowski, 2006: bit-par alg with O(nδ + n / w m) worst case time.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
6
Surprisingly little work specifically on the(δ,γ,α)-matching problem...
Crochemore et al., 2002: dynamic programming alg,runs in O(mn) worst-case time. Uses a min-queue.
Of course, also a brute-force DP alg is possible:O(mn α) time, but may be faster in practice than
the more sophisticated alg above (as α usually small).
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
7
Our contributions
We improve the basic dynamic programming based algorithm to run in O(nα δ/σ) average time.
We propose a simple sparse DP alg with O(n) avg timeand O(min(mn, |M|α)) worst-casetime,
where M = { (i,j) | pi =δ tj }.
We develop a bit-parallel algorithm that runs in O(nδ + mn log γ / w) worst case time.
Its avg time complexity is close to O(n log γ α (δ/σ) / w + n), assuming small α.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
8
Basic dynamic programing
Let us have matrix D, with each cell (i, j) corresponding to the search state of pattern prefix p0 ... pi in text T.
More precisely, a γ-bounded value of Di,j will denote that p0 ... pi matches T at the end position j.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Brute-force computation in O(mn α) time and O(n) space (enough to store only the curr and prev row).
We can also proceed column-wise: same time but O(αm) space instead.
9K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Cut-off trick for improving the avg time(Ukkonen, 1985; Cantone et al., 2005)
Usually, calculating all the matrix cells is an overkill.
Observation: if Di...m–1,j–α...j > γ then Di+1...m–1,j+1 > γ.
Read: it’s not so easy to get out of a ‘dead zone’.
m
10
DP-CO, cont’d
The avg time is O(n (αδ/σ)2). (Pessimistic analysis, we weren’t able to take the gamma restriction
into account.)
The worst case remains O(mn α),but as in (Crochemore et al., 2002) it can
be improved to O(mn). The difference is we handle m queues as we proceed column-wise.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
11
Simple algorithm(ingenious name, eh?)
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
In a few words: naïve brute force DP algorithm but applied only locally.
We work on lists Li, corresponding to individual rows.
We start with L0 = { j | tj =δ p0 } (obtained in O(n) time).
For i=1...m–1:Li = { j | tj =δ pi AND Di–1,j’ + |pi–tj| γ AND
0 < j–j’ α +1 }
We put each j only once into Li (if there are many j’ that can cause it, we choose the one that minimizes the new Di,j).
Obtaining list Li takes O(α|Li–1|) time.
12
Simple algorithm, cont’d
Complexity
All lists have length |M| in total in the worst case.Which implies O(|M|α) worst case time.
But: (i) on average this is much better,(ii) we can improve somewhat the worst case.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
13
Simple algorithm, cont’d
Average case analysis
The length of list L0 is O(n δ/σ) on avg.Hence L1 is computed in O(n α δ/σ) avg time.But its avg length is only O(n δ/σ α δ/σ).
...........................In general, computing Li takes O(n (α δ/σ)i) avg time.
The total time will be summation over m such components.
Note that α, δ, σ are fixed for a given problem instance.In other words, α δ/σ can be considered a constant.
If the constant α (2δ+1)/σ is less than 1, we have a geometric series with O(n) sum.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
14K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Simple algorithm, cont’d
Improving the worst case
Idea: avoid brute-force handling of overlapping windows of α+1 size.
We make use of a min-queue (Gajewska & Tarjan, 1986), similarly to the concept from (Crochemore et al., 2002).
The queue always keeps up to α+1 integers, namely the error sums corresponding to the sliding window area in the previous row. For each
processed cell 0 or 1 values are inserted to the front of the queue (O(1) time) and from 0 to α+1 values deleted from the tail. But we can’t remove more than we’ve inserted. Hence O(1) amortized cost per cell.
This improves the worst-case time complexity to O(min(mn, |M|α)).
15
Bit-parallelism technique(in stringology)
Baeza–Yates (1989) noticed that CPU registers are usually longer than 1 bit...
And he made use of this fact.
In O(1) time we can peform operations like logical and (&), or (|), shifts (<<, >>)etc. on a whole machine word (usu. 32 or 64 bits).
Nowadays, bit-parallelism is a very popular techniquein string matching algorithms, in theory and in practice.
Also useful for many approximate matching variants.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
16
Bit-parallel dynamic programming
Modified DP alg: let the cells of D be chunks of O(log γ) bits. We’ll be able to compute O(w / log γ) cells in parallel.
More precisely, each cell will use l + 1 bits, where l = log2(2γ +1).
Error sum zero will be encoded as 2l–1 – (γ +1),γ +1 (the lowest ‘illegal’ value) will be thus 2l–1