Tries
(Compacted) Trie
1
2 2
0
4
5
6
7
2 3
y
s
1z
stile zyg
5
etic
ialygy
aibelyite
czecin
omo
systile syzygetic syzygial syzygy szaibelyite szczecin szomo
[Fredkin, CACM 1960]
(2; 3,5)
Performance:• Search ≈ O(|P|) time
• Space ≈ O(K + N)
(Compacted) Trie
1
2 2
0
4
5
6
7
2 3
y
s
1z
stile zyg
5
etic
ialygy
aibelyite
czecin
omo
systile syzygetic syzygial syzygy szaibelyite szczecin szomo
[Fredkin, CACM 1960]
(2; 3,5)
... But in practice…• Search: random memory accesses
• Space: len + pointers + strings
Performance:• Search ≈ O(|P|) time
• Space ≈ O(K + N)
….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….
systile szaielyite
CTon a sample
2-level indexing
Disk
InternalMemory 2 limitations:
• Sampling rate ≈ lengths of sampled strings
• Trade-off ≈ speed vs space (because of bucket size)
2 advantages:• Search ≈ typically 1 I/O
• Space ≈ Front-coding over buckets
An old idea: Patricia Trie
1
2 2
0
4
5
6
7
2 3
y
s1
z
stile zyg
5
etic
ial
ygy
aibelyite
czecin
omo
[Morrison, J.ACM 1968]
A new search
….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….
2 2
0
y
s
1 z
sz
5
e
i
y
a
c
o
Search(P):• Phase 1: tree navigation• Phase 2: Compute LCP• Phase 3: tree navigation
Three-phase search:P = syzyyea
01
2 5 g < y
P’s positionOnly 1 string is checked
Trie Space ≈ #strings, NOT their
length
[Ferragina-Grossi, J.ACM 1999]
….Locality Preserving Front Coding….
PTon all strings
2-level indexing
Disk
InternalMemory
A limitation is n < M
Typically 1 I/O
What about n > M
The String B-tree
29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23
29 2 26 13 20 25 6 18 3 14 21 23
29 13 20 18 3 23
PT PT PT
PT PT PT PT PT PT
PTSearch(P) •O((p/B) logB n) I/Os•O(occ/B) I/Os
It is dynamic...
1 string checked : O(p/B)
O(logB n) levels
+
Lexicographic position of P
[Ferragina-Grossi, J.ACM 1999]
Knuth, vol 3°, pag. 489: “elegant”
GA
1
65
3
6
4
0
4 5 6 72 3
AGA GCGC
GGAGAG C
A G A G A
On Front-Coding…
AGAAGA5 G3 C0 GCGCAGA6 G4 GGA6 GA
Knuth
In-order visit+
Path covering
Front Coding
3
0
Compacted Trie =
FC + tree structureWhat about other traversals ?
FC + ... is searchable
Why pre-order visit
In Front-coding theLcp information is encoded many times
GA
1
65
3
6
4
0
4 5 6 72 3
AGA GCGC
GGAGAG C
A G A G A
3
0
AGAAGA1 G3 C4 GCGCAGA1 G3 GGA1 GA
RearCoding
What do we mean by “Indexing” ?
Word-based indexes, here a notion of “word” must be devised !» Inverted lists, Signature files, Bitmaps.
Full-text indexes, no constraint on text and queries !» Suffix Array, Suffix tree, String B-tree,...
Basic notation and facts
Occurrences of P in T = All suffixes of T having P as a prefix
SUF(T) = Sorted set of suffixes of T
T = mississippi mississippi 4,7P = si
T[i,N]
iff P is a prefix of the i-th suffix of T (ie. T[i,N])
TPi
Pattern P occurs at position i of T
From substring searchTo prefix search
Reduction
The Suffix Tree
T# = mississippi# 2 4 6 8 10
12
11 8
5 2 1 10 9
7 4
6 3
0
4
#
i
ppi#
ssi
mississip
pi#
1
p
i# pi#
2
1
s
i
ppi#
ssippi#
3
si
ssippi#
ppi#
1
#
ppi#ssip
pi#
Label = <pos,len>
Space: #nodes
Search pattern P
Maximal repeatedsubstring = node
The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
P=si
T = mississippi#
#i#ippi#issippi#ississippi#mississippi#pi#ppi#sippi#sissippi#ssippi#ssissippi#
SUF(T)
Suffix Array• SA: N log2 N) bits
• Text T: N chars In practice, a total of 5N bytes
SA
121185211097463
T = mississippi#
suffix pointer
5
Prop 2. Starting position is the lexicographic one of P.
Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp
T = mississippi#SA
121185211097463
P = si
P is larger
2 accesses per step
Searching a patternIndirected binary search on SA: O(p) time per suffix cmp
T = mississippi#SA
121185211097463
P = si
P is smaller
Suffix Array search• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
overall, O(p log2 N) time
+[Manber-Myers, ’90]
Locating the occurrences
T = mississippi# 4 7SA
121185211097463
si#
occ=2
121185211097463
121185211097463 si$
Suffix Array search• O (p + log2 N + occ) time
where # < < $
sissippisippi
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Text mining
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=jLcp
01140010213
#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#
12
1185211097463
SA
Lcp(7,3) = 1= min{2,1,3}
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Text mining
• Does it exist a repeated substring of length ≥ L ?
• Maximal Lcp of a suffix is with its adjacent
• Search for Lcp[i] ≥ L
Lcp
01140010213
#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#
12
1185211097463
SA
Lcp
01140010213
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA
Text mining
Does exist a substring of length ≥ L occurring ≥ C
times ?
• Exist ≥ C equal substrings of length ≥ L chars
• Exist ≥ C suffixes sharing a prefix of ≥ L chars
• These suffixes may be not contiguous, but...
• Their “block” has a common prefix of ≥ L
chars
• Search for Lcp[i,i+C-2] whose entries are
≥ L
#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#
12
1185211097463
SA
L = 1, C = 4
How to construct SA from T ?
#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#
12
1185211097463
SA
Elegant but inefficient
Obvious inefficiencies:• (n2 log n) time in the worst-case• (n log n) cache misses or I/O faults
Input: T = mississippi#
The skew algorithm The key problem: Compare efficiently two
suffixes Brute-force = (n) time per cmp, (n2 log n)
total
In order to sort the suffixes of S1. Divide the suffixes of S in two groups
S0,2 = suffixes starting at positions 0 mod 3 or 2 mod 3
S1 = suffixes starting at positions 1 mod 3
2a. Sort recursively S0,2 (they are 2n/3)
2b. Sort S1: suffix(3i+1) = S[3i+1] suff(3i+2)
3. Merge the sorted S0,2 with the sorted S1
T(n) = O(split) + T(2n/3) + O(|S1|) + O(merge) = O(n)
Sort recursively S0,2
We turn this problem into the SA-construction of a shorter string of length (2/3)n.
S=AAT GTG AGA TGA $$$
RadixSort all triplets that start at positions 0,2 mod
3 T = {ATG, TGT, TGA, GAG, GAT, ATG, GA$, A$$} Sort(T) = (A$$, ATG, GA$, GAG, GAT, TGA, TGT)
Assign lexicographic names (log n bits) A$$=1, ATG=2, GA$=3,…
Build s0,2 and encode it: ATG TGA GAT GA$ TGT GAG ATG A$$ 2 6 5 3 7 4 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13
Sort recursively S0,2
Given
S=AAT GTG AGA TGA $$$
We have built:
s0,2 = ATG TGA GAT GA$ TGT GAG ATG A$$ enc(s0,2) = 2 6 5 3 7 4 2 1
It is SA0,2 = [12, 9, 2, 11, 6, 8, 5, 3]
1 2 3 4 5 6 7 8 9 10 11 12 13
A suffix of s0,2
A suffix of enc(s0,2)
SA(enc(s0,2)) gives SA0,2
Lex-order ispreserved
Sort S1
We turn this problem into the sort of pairs
S=AAT GTG AGA TGA $$$
Key observation: suff(1) = <A, pos(2)> = <A,3>
suff(7) = <A, pos(8)> = <A,6>
SA0,2 = [12, 9, 2, 11, 6, 8, 5, 3]
Suffix of S1
1 2 3 4 5 6 7 8 9 10 11 12 13
SA1 = [1, 7, 4, 10]
The merge step
To merge suffix si in S0,2 with suffix sk in S1, note that
If (i mod 3) = 2 si+1 and sk+1 belong to S0,2
If (i mod 3) = 0 si+2 and sk+2 belong to S0,2
their order can be derived from SA0,2 in O(1) time
SA1 SA0,2
SA
T(n) = T(2n/3) + O(n) + O(merge) = O(n)
S=AAT GTG AGA TGA $$$1 2 3 4 5 6 7 8 9 10 11 12 13