Efficient algorithms for ( δ , γ , α )-matching

Efficient algorithmsfor (δ,γ,α)-matching

Szymon GrabowskiComputer Engineering Dept.,Tech. Univ. of Łódź, Poland

[email protected]

PSC, Prague, August 2006

Kimmo FredrikssonDept. of Computer Science Univ. of Joensuu, Finland

[email protected]

2

For example, it’s not relevant for music information retrieval (MIR)

and molecular biology.

Several approximate matching models have thus been developed...

String matching in its classic form: given text T = t0t1 ... tn–1, and pattern P = p0p1 ... pm–1

over a finite alphabet Σ of size σ, report all occurences of P in T.

Such simple problem variant (exact matching)is not very useful for many applications.

Problem setting

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

3

Models & applications – music information retrieval

We allow classes of characters: the classes are continuous intervals (of equal width, 2δ+1, for all pattern positions).

This corresponds to handling little distortions of the melody (singer / whistler unskilled or under influence...).

Limitation on the sum of individual errors γ (< mδ).

Gaps also allowed – this is to skip ornamentation (esp. in classical music). We assume all gaps are in [0, α] range.

Transposition invariance – the key of the melody can be arbitrary, i.e. everything can be shifted up or down

by a fixed value.

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matchingFuture work, hopefully...

4

Problem we consider here

(δ,γ,α)-matching

Two symbols a, b Σ delta-match ( we write a =δ b ) iff |a – b| δ.

We say that a pattern P (δ,γ,α)-matches the text substring ti0 ti1 ... ti(m–1),

if pj =δ tij for j {0 ... m–1},where 0 < ij+1 – ij α+1,

and


1

0

m

jijj tp

5

Previous work on similar models

(δ, α)-matching:Crochemore et al., 2002: O(mn) time (worst, avg, and best case).

Cantone et al., 2005a: also O(mn) in every case to find not only the end positions of the occurences but also all the matching sequences.

Cantone et al., 2005b: achieving O(n) on avg (for constant α) and retaining O(mn) in the worst case.

Navarro & Raffinot, 2003; Cantone et al., 2005b: nondeterministic finite automaton with O(n mα / w) worst case time.

Along these lines: Fredriksson & Grabowski, 2006: more compact automaton with O(n m log(α) / w) worst case time.

Fredriksson & Grabowski, 2006: bit-par alg with O(nδ + n / w m) worst case time.


6

Surprisingly little work specifically on the(δ,γ,α)-matching problem...

Crochemore et al., 2002: dynamic programming alg,runs in O(mn) worst-case time. Uses a min-queue.

Of course, also a brute-force DP alg is possible:O(mn α) time, but may be faster in practice than

the more sophisticated alg above (as α usually small).


7

Our contributions

We improve the basic dynamic programming based algorithm to run in O(nα δ/σ) average time.

We propose a simple sparse DP alg with O(n) avg timeand O(min(mn, |M|α)) worst-casetime,

where M = { (i,j) | pi =δ tj }.

We develop a bit-parallel algorithm that runs in O(nδ + mn log γ / w) worst case time.

Its avg time complexity is close to O(n log γ α (δ/σ) / w + n), assuming small α.


8

Basic dynamic programing

Let us have matrix D, with each cell (i, j) corresponding to the search state of pattern prefix p0 ... pi in text T.

More precisely, a γ-bounded value of Di,j will denote that p0 ... pi matches T at the end position j.


Brute-force computation in O(mn α) time and O(n) space (enough to store only the curr and prev row).

We can also proceed column-wise: same time but O(αm) space instead.

9K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Cut-off trick for improving the avg time(Ukkonen, 1985; Cantone et al., 2005)

Usually, calculating all the matrix cells is an overkill.

Observation: if Di...m–1,j–α...j > γ then Di+1...m–1,j+1 > γ.

Read: it’s not so easy to get out of a ‘dead zone’.

m

10

DP-CO, cont’d

The avg time is O(n (αδ/σ)2). (Pessimistic analysis, we weren’t able to take the gamma restriction

into account.)

The worst case remains O(mn α),but as in (Crochemore et al., 2002) it can

be improved to O(mn). The difference is we handle m queues as we proceed column-wise.


11

Simple algorithm(ingenious name, eh?)


In a few words: naïve brute force DP algorithm but applied only locally.

We work on lists Li, corresponding to individual rows.

We start with L0 = { j | tj =δ p0 } (obtained in O(n) time).

For i=1...m–1:Li = { j | tj =δ pi AND Di–1,j’ + |pi–tj| γ AND

0 < j–j’ α +1 }

We put each j only once into Li (if there are many j’ that can cause it, we choose the one that minimizes the new Di,j).

Obtaining list Li takes O(α|Li–1|) time.

12

Simple algorithm, cont’d

Complexity

All lists have length |M| in total in the worst case.Which implies O(|M|α) worst case time.

But: (i) on average this is much better,(ii) we can improve somewhat the worst case.


13


Average case analysis

The length of list L0 is O(n δ/σ) on avg.Hence L1 is computed in O(n α δ/σ) avg time.But its avg length is only O(n δ/σ α δ/σ).

...........................In general, computing Li takes O(n (α δ/σ)i) avg time.

The total time will be summation over m such components.

Note that α, δ, σ are fixed for a given problem instance.In other words, α δ/σ can be considered a constant.

If the constant α (2δ+1)/σ is less than 1, we have a geometric series with O(n) sum.




Improving the worst case

Idea: avoid brute-force handling of overlapping windows of α+1 size.

We make use of a min-queue (Gajewska & Tarjan, 1986), similarly to the concept from (Crochemore et al., 2002).

The queue always keeps up to α+1 integers, namely the error sums corresponding to the sliding window area in the previous row. For each

processed cell 0 or 1 values are inserted to the front of the queue (O(1) time) and from 0 to α+1 values deleted from the tail. But we can’t remove more than we’ve inserted. Hence O(1) amortized cost per cell.

This improves the worst-case time complexity to O(min(mn, |M|α)).

15

Bit-parallelism technique(in stringology)

Baeza–Yates (1989) noticed that CPU registers are usually longer than 1 bit...

And he made use of this fact.

In O(1) time we can peform operations like logical and (&), or (|), shifts (<<, >>)etc. on a whole machine word (usu. 32 or 64 bits).

Nowadays, bit-parallelism is a very popular techniquein string matching algorithms, in theory and in practice.

Also useful for many approximate matching variants.


16

Bit-parallel dynamic programming

Modified DP alg: let the cells of D be chunks of O(log γ) bits. We’ll be able to compute O(w / log γ) cells in parallel.

More precisely, each cell will use l + 1 bits, where l = log2(2γ +1).

Error sum zero will be encoded as 2l–1 – (γ +1),γ +1 (the lowest ‘illegal’ value) will be thus 2l–1

(old trick, e.g., Fredriksson & Navarro, 2004; Crochemore et al., 2005).

This representation can solve 3 issues:(i) checking in parallel if some counters exceed γ,

(ii) parallel handling of counter overflows, (iii) computing pairwise minima over two sets of counters

in parallel.K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

17

Tiling the DP matrix with C = w / (l+1) × 1 vectors (C = 8). The dark gray cell of the current tile depends on the

light gray cells of the two tiles in the previous row (α = 4).

We are in row i. Thx to preprocessing, we know the delta-errors between all chars in the current tile (C cells) and P[i].

Problem: How to calculate the new values of Di,*?

BP-DP, cont’d


18

Solution #1. Naïve shifts (chunk by chunk) and minimizations with O(α) factor.

Solution #2. Similar but with a halving technique: first shift by α / 2 counter positions, then by α / 4 etc.

performing the minimization at each step.It yields O(log α) time factor.

Solution #3. Use a precomputed function.Which we choose, as it gives O(1) time for a

O(w)-bit chunk (in practice some w’, e.g. w’=w / 4).

BP-DP, cont’d


19

Pre-emptying the computation in the BP-DP search

The cut-off trick can again be used. With some modification since now we calculate C cells in

parallel. (Read: the picture at slide 9 will be less jagged and the trick is somewhat less efficient here.)

Avg search time is (upper bound estimation, maybe not tight):

O(n / C α δ/σ + n).


20

How to find minima in parallel forthe O(w / log γ)-sized chunks

Precomputing as usual (ugly...) or an old trick (Paul & Simon, 1980)


21

Preprocessing in BP-DP

Preprocessing is simple.We build a helper bit-matrix V such that Vi,j = |pi – tj| if pi =δ tj , and γ+1 otherwise.

Note that the numbers of rows in V can be reduced to the # of unique symbols in P (why storing completely repeating

rows?), which is σP. We call this terse representation V’.

First we fill V’ with γ+1 values in O(n / C σP) time. Then we scan T and set 0..δ in at most 2δ +1 rows of V’

(those that δ-match the current char from T). Worst case time of the latter phase: O(nδ). Less on avg.


22

Lazy preprocessing

Note that in the previous scheme (with cut-off) the avg time may be even O(n) but the preprocessing

typically superlinear (even if not much).

To avoid costly preprocessing in the case when search will be fast (i.e. the cut-off thing will work efficiently),

we can interweave the preprocessing and search phases.

This leads to O(n / C α δ/σ + n) avg preprocessing time (pessimistic analysis), i.e. matches the avg search time.


23

Multiple patterns

The bit-par alg has relatively high preprocessing cost:O(nδ + P n / w / log γ ) in the worst case.

If we are however about to search for r patterns, the search time is multiplied by r,

but the good news is that the preprocessing is increasedmuch more mildly:

to O(nδ + P n / w / log γ +rm),where P is now the # of distinct symbols in the

whole pattern set.

Practical (well-known) trick for r patterns if r small compared to / δ: superimpose pattern (then verify).


24

Test methodology

All algorithms implemented in C, compiled with icc 9.0.

Test machine: P4 2.4 GHz, 512 MB, running GNU/Linux 2.4.20.

Avg times reported over 100 trials (randomly extracted patt.).

Text files:1. Concatenation of 7543 music pieces (MIDI, stripped off of anything

except pitch values), totalling 1.8 MB. Alphabet: [0..127] range, but far from random: only 55 values actually occur, and only 6 most freq

symbols cover ~50% of the whole text.

2. Uniformly random data in 0..127 range.


25

Compared algorithms

BP Cut-off: bit-parallel dynamic programming with cut-off (without the lazy preprocessing).

BP Filter: the (δ,α)-matching version of BP Cut-off (Fredriksson & Grabowski, 2006)

used as a filter, and DP-CO used for verifications.

DP Cut-off: dynamic programming with cut-off.

Simple: simple sparse DP (in the O(|M|α) worst case time version).


26

Experimental results, MIDI δ = 1, γ = 4, α = 1


27

Experimental results, MIDI δ = 4, γ = 16, α = 2



Experimental results, randomδ = 4, γ = 16, α = 2

29

Conclusions

Bit-parallelism works well also for the (δ,γ,α) search problem...

...But it works even better if regions of text where matches cannot be extended are quickly discarded.

Still, BP-DP for (δ,γ,α) disappoints compared to BP-DP for (δ,α) used as a filter.

(Problem: the γ counters need many bits...)

Consistently best alg in the tests was a simple heuristic (called Simple alg). Fortunately, it doesn’t have competitive

worst-case time.


30

Future plans

Research on extended models: most importantly with transposition invariance.

Some purely theoretical variants (e.g., better complexity for large alpha).

Injecting compression to represent bit vectors more succinctly and thus speed up the search?

Can we replace the log γ factor in the bit-par algwith log δ?

(Hint: in each step we increase the counters by at most δ only.)


Efficient algorithms for ( δ , γ , α )-matching

Documents