Top Banner
On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi
27

On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Dec 17, 2015

Download

Documents

Cody Gilmore
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

On the suffix automaton with mismatches

Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi

Page 2: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Prague, 17/07/2007 CIAA 2007

Outline

1. Motivations and basic definitions

2. Nerode’s congruence …with mismatches

3. Suffix automata with mismatches

4. Conclusions and open problems

Page 3: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

In literature several data structures have been studied for storing the suffixes of a text. Each of them is conceived for giving a fast access to all factors of the text itself. Among them:

suffix tries: representation of all the suffixes of a word by an ordinary tree - quadratic size in the length of the word; suffix trees: compact representations of suffix tries - linear size in the length of the word; suffix automata: minimization (related to automata) of suffix tries - linear size in the length of the word; compact suffix automata: compact representations of suffix automata - linear size in the length of the word.

Page 4: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Suffix automata, compact suffix automata and suffix trees have many applications, such as indexing, pattern matching, and data compression.

They both linear size.

but suffix trees and compact suffix automata represent strings by pointers to the text, while suffix automata work without the need of accessing it.

Why suffix automata?

Page 5: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Prague, 17/07/2007 CIAA 2007

1. Data structures recognizing languages with mismatches for approximate string matching and its applications, such as - recovering the original signals after their transmission over noisy channels;- finding DNA subsequences after possible mutations;- text searching where there are typing or spelling errors;- retrieving musical passages.

2. Independent theoretical interest, such as, for instance, the modelling of some evolutionary events in molecular biology.

Why mismatches?

Page 6: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

In Blumer et al. (1985)

1. a linear algorithm for building the suffix automaton of a word w on a fixed alphabet is given (based on Nerode’s congruence);

2. it is showed that this suffix automaton must have at least |w|+1 states and at most 2|w| complexity [Carpi, de Luca in 2001 have proved that the lower bound is joined for any prefix of Fibonacci word].

Page 7: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Prague, 17/07/2007 CIAA 2007

In this paper we focus on the minimal deterministic finite automaton, denoted by Sk, that recognizes the set of suffixes Suff(w,k) of a word w up to k errors.

1. First main result: characterization of the Nerode's right-invariant congruence relative to Sk and a Conjecture on the size of Sk.

2. Second main result: description of an algorithm that makes use of Sk in order to accept, in an efficient way, the language of all suffixes of w up to k errors in every window of size r, (r=repetition index).

Page 8: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Prague, 17/07/2007 CIAA 2007

Ex.: x=acgtatct, y=aggttact

The distance d(x,y) between two strings x and y is the minimal cost of a sequence of operations that transform x into y (and if no such

sequence exists).

We consider the Hamming distance, that allows only substitutions, with cost 1 (simplified definition). It is finite whenever |x|=|y| and it holds 0 d(x,y) |x|.

d(x,y)=3 (in the simplified definition)

Basic definitions

A string x k-occurs in w if it occurs in w at position l, 1≤l≤|w|, up to k errors. A string x that k-occurs in w as a suffix of w is a k-suffix of w.

Page 9: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Prague, 17/07/2007 CIAA 2007

Suffixes with One Mismatch

“a”: Suff(a,1)={,a,b}. The minimal automaton has 2 states.

“ab”: Suff(ab,1) = {,a,b,aa,ab,bb}. The minimal automaton has 4 states.

“aba”: Suff(aba,1)={,a,b,aa,ba,bb,aaa,aba,abb,bba}. The minimal automaton has 6 states.

“abaa”: Suff(abaa,1) = {,a,b,aa,ab,ba,aaa,baa,bab,

bba,aaaa,abaa,abab,abba,bbaa}. The minimal automaton has 11 states.

Page 10: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Prague, 17/07/2007 CIAA 2007

On Nerode’s congruence… with mismatches

Definition 1 Let w*. y *, y≠ end-setw(y,k) = {i | y k-occurs in w with final position i}.

Notice that end-setw(, k) = {0,1, …, |w|}.

Definition 2: x, y * are endk-equivalent, x ≡w,ky, on w if1. end-setw(x, k) = end-setw(y, k);2. i end-setw(x,k) = end-setw(y, k), the number of errors available in the suffix of w having i+1 as first position is the same after the reading of x and of y, i.e.

min{|w|-i, k-erri(x)} = min{|w|-i, k-erri(y)} ,erri(u)=#(mismatches) of u in w with final position i.

[x]w,k =equivalence class of x with respect to ≡w,k.

Page 11: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

x ≡w,ky if1. x and y have the same end-set in w up to k mismatches as in the exact case [Blumer et al.], 2. #(available errors) in the suffix of w after the reading of x and of y is the same.

The definition includes two cases depending on the considered final position iend-setw(x,k) = end-setw(y, k):

2.a) |w|-i≥max{k-erri(x),k-erri(y)} k-erri(x)=k-erri(y) erri(x)=erri(y). (In this case min{|w|-i,k-erri(x)}= k-erri(x) = k-erri(y) = min{|w|-i,k-erri(y)})

2.b) |w|-i ≤ min{k-erri(x), k-erri(y)} it is possible to have mismatches in any position of the suffix of w having length |w|-i. This does not necessarily imply that erri(x) = erri(y). (In this case min{|w|-i,k-erri(x)} = |w|-i = min{|w|-i,k-erri(y)})

In other words …

Page 12: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Let w = abaababaab, k=2.

1. x = baba, y = babb,

end-setw(x, 2) = {5, 6, 8, 10} = end-setw(y, 2)

but x ≡w,ky.

i = 5 err5(x) = 2, err5(y) = 1

min{|w|-5,2-err5(x)} = 0 ≠ 1= min{|w|-5,2-err5(y)}

Example

1 2 3 4 5 6 7 8 9 10i

Page 13: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Let w = abaababaab, k=2.

2. x = abaababa, y = baababa, x ≡w,ky:

end-setw(x, 2) = {8} = end-setw(y, 2)

i = 8 err8(x) = 0 = err8(y)

min{|w|-8,2-err8(x)}=2=min{|w|-8,2-err8(y)}

Example (contd)

1 2 3 4 5 6 7 8 9 10i

Page 14: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Let w = abaababaab, k=2.

3. x = abaababaa, y = baababab, x ≡w,ky:

end-setw(x, 2) = {9} = end-setw(y, 2)

i = 9 err9(x) = 0 ≠ 1= err9(y) but

min{|w|-9,2-err9(x)}=1=min{|w|-9,2-err9(y)}

Example (contd2)

1 2 3 4 5 6 7 8 9 10i

Page 15: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Prague, 17/07/2007 CIAA 2007

Results

In Blumer et al. (exact case)

≡w is a right-invariant equivalence relation on *.

x ≡w y x is a suffix of y (or vice-versa).

xy ≡wy every occurrence of y is immediately

preceded by an occurrence of x.

Lemma 1 (approximate case)

≡w,k is a right-invariant equivalence relation on *.

x ≡w,ky x is a suffix of y up to 2k errors (or vice-versa).

xy ≡w,ky i end-setw(xy, k)=end-setw(y,k),

the k-occurrence of y with final position i is immediately preceded by a t-occurrence of x, where t = max{(k-erri(y))-(|w|-i), 0)}.

Page 16: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Results (contd)Theorem 1. x ≡w,ky (z*, xz is a k-suffix of w yz is a k-suffix of w) (they have the same future in w).

Corollary 1. w*, the (partial) DFA Sk=(,Q,q0,F, δ) having

• input alphabet , • state set Q={[x]w,k| x is a k-occurrence of w}, • initial state q0=[]w,k, • accepting states (F) those equivalence classes that include the k- suffixes of w (i.e., whose end-sets include the position |w|), • transition function δ:[x]w,k → [xa]w,k , x and xa are k-occurrences of w,

is the minimal deterministic finite automaton which recognizes the set Suff(w, k).

a

Page 17: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

What about the size of Sk?

Gad Landau asked for a data structure having size “close” to |w| that allows approximate pattern matching in time proportional to the query plus the number of occurrences.

In the NON approximate case suffix trees and (compact) suffix automata do the job.

What about approximate case?

Page 18: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Prefixes of Fibonacci word

2, 4, 6, 11, 15, 18, 23, 28, 33, 36, 39, 45, 50, 56, 61, 64, 67, 70, 73, 79, 84, 90, 96,102, 107, 110, 113, 116, 119, 122, 125,128, 134, 139, 145, 151, 157, 163, 169,175, 180, 183, 186, 189, 192, 195, 198, 201, 204, 207, 210, 213, 216, 222, 227,233, 239, 245, 251, 257, 263, 269, x?.....It is not in the Sloane & al. Database

Page 19: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Writing {an+1-an}n we obtain

2, 2, 5, 4, 3, 5, 5, 5, 3, 3, 6, 5, 6, 5, 3, 3,3, 3, 6, 5, 6, 6, 6, 5, 3, 3, 3, 3, 3, 3, 3, 6,5, 6, 6, 6, 6, 6, 6, 5, 3, 3, 3, 3, 3, 3, 3, 3,3, 3, 3, 3, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 3, 3, 3, 3, 3, 3, 3, 3, x? .......

It seems easier. Let us Run-Length encode.

Page 20: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Run-Length encode

two 2, one 5, one 4, one 3, three 5, two 3,

one 6, one 5, one 6, one 5, four 3,

one 6, one 5, three 6, one 5, seven 3,

one 6, one 5, six 6, one 5, twelve 3,

one 6, one 5, eleven 6, one 5, twenty 3,

one 6, one 5, nineteen 6, one 5, x? 3, ......

Which is the rule?

Page 21: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Conjecture on the size of S1 for prefixes of Fibonacci word

An initial part, and then from i=4, 5, ....

one 6, one 5, (fibi-1

-2) 6, one 5, (fibi-1) 3,…

Conjecture 1: The size of the suffix automaton with one mismatch of the prefixes of the Fibonacci word grows according to

afibn = afibn-1

+ 3(afibn-3-1) + 10 + 6(afibn-4

-1)

We did not prove the rule. The rule holds true up to prefixes of length 2000. It is a conjecture that the rule describes this sequence.

Page 22: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Prague, 17/07/2007 CIAA 2007

Other experiments and Final Conjecture

bban, n≥4 an+1-an=19+6*(n-4), Prefixes of Thue-Morse word |S1|≤2|w|log|w| Random words generated by memoryless

sources |Sk|=O(|w|logk|w|) [Epifanio Gabriele Mignosi

Restivo Sciortino 2003, 2005; Maas Nowak 2005].

ConjectureThe suffix automaton with k mismatches of any

text w has size O(|w|logk|w|).

Page 23: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Prague, 17/07/2007 CIAA 2007

Definition:wΣ*, k, r Z+{0}, k ≤ r. x occurs in w at position l up to k errors in a window of size r, or simply kr-occurs in w at position l, if:

− if |x| < r d(x, w(l, l+|x|-1)) ≤ k;− if |x| ≥ r i, 1≤ i ≤ |x|-r+1, d(x(i,i+r-1), w(l+i-1, l+i+r-2)) ≤ k.

A string x satisfying above property is a kr-occurrence of w. A string x that kr-occurs in w as a suffix of w is a kr-suffix of w.

Allowing more mismatches

L(w,k,r) ={x | x kr-occurs in w at position l, 1≤ l ≤ |w|-|x|+1}. Suff(w,k,r) ={x | x kr-suffix of w}.

Remark:Suff(w,k) = Suff(w,k,r) when r = |w|.

Page 24: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Prague, 17/07/2007 CIAA 2007

Example

w=abaa, k=1, r=2

• L(w,1,2)={,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa, bab,bba,bbb,aaaa,aaab,abaa,abab,abba,bbaa,bbab, bbba} Remark: bbba L(w,1,2), but bbba L(w,1,4)=L(w,1)

• Suff(w,1,2)={,a,b,aa,ab,ba,aaa,aab,baa,bab,bba, aaaa,aaab,abaa,abab,abba,bbaa,bbab,bbba}

Page 25: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Prague, 17/07/2007 CIAA 2007

The Repetition Index R(w,k,r) of w is the smallest integer h such that all strings of length h kr-occur at most once in w.

Remarks:1. R(w,k,r) is well defined because the integer h=|w| is an element of

the set above described;2. If k/r 1/2 then R(w,k,r)=|w|;3. Equation r = R(w,k,r) admits an unique solution.

Lemma 2: Given Sk there exists a linear time algorithm that returns r=R(w,k,r).

Remark: This algorithm labels each state of Sk with an integer that represents a distance from this state to the end.

Page 26: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Algorithm that lets Sk recognize Suff(w,k,r)

Algorithm (x,r,Sk)

• |x|≤r = R(w, k, r)

if x is accepted by Sk then xSuff(w,k,r)

else xSuff(w,k,r)

• |x|>r = R(w, k, r)

let x’= prefix of x such that |x’|= r = R(w, k, r);

let q be the state reached after reading x’ and i the integer

associated to q;

|w|-i-r+1=j is the unique possible initial position of x;

check if x kr-occurs at position j in w.

Page 27: On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.

Prague, 17/07/2007 CIAA 2007

Conclusions and open problems

1. Sk can be useful for approximate indexing.

2. If Conjecture 2 is true and constants involved in O-notation are small, our data structure is useful for some classical applications of approximate indexing.

3. We think that it is possible to connect Sk with Sk,r and conjecture that |Sk,r| = O(|Sk|).

4. We think that it is possible to obtain an online algorithm even when dealing with mismatches. It would be probably more complex than the classical one. It still remains an open problem how to define it.