Top Banner
Tries
27

Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Tries

Page 2: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

(Compacted) Trie

1

2 2

0

4

5

6

7

2 3

y

s

1z

stile zyg

5

etic

ialygy

aibelyite

czecin

omo

systile syzygetic syzygial syzygy szaibelyite szczecin szomo

[Fredkin, CACM 1960]

(2; 3,5)

Performance:• Search ≈ O(|P|) time

• Space ≈ O(K + N)

Page 3: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

(Compacted) Trie

1

2 2

0

4

5

6

7

2 3

y

s

1z

stile zyg

5

etic

ialygy

aibelyite

czecin

omo

systile syzygetic syzygial syzygy szaibelyite szczecin szomo

[Fredkin, CACM 1960]

(2; 3,5)

... But in practice…• Search: random memory accesses

• Space: len + pointers + strings

Performance:• Search ≈ O(|P|) time

• Space ≈ O(K + N)

Page 4: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….

systile szaielyite

CTon a sample

2-level indexing

Disk

InternalMemory 2 limitations:

• Sampling rate ≈ lengths of sampled strings

• Trade-off ≈ speed vs space (because of bucket size)

2 advantages:• Search ≈ typically 1 I/O

• Space ≈ Front-coding over buckets

Page 5: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

An old idea: Patricia Trie

1

2 2

0

4

5

6

7

2 3

y

s1

z

stile zyg

5

etic

ial

ygy

aibelyite

czecin

omo

[Morrison, J.ACM 1968]

Page 6: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

A new search

….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….

2 2

0

y

s

1 z

sz

5

e

i

y

a

c

o

Search(P):• Phase 1: tree navigation• Phase 2: Compute LCP• Phase 3: tree navigation

Three-phase search:P = syzyyea

01

2 5 g < y

P’s positionOnly 1 string is checked

Trie Space ≈ #strings, NOT their

length

[Ferragina-Grossi, J.ACM 1999]

Page 7: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

….Locality Preserving Front Coding….

PTon all strings

2-level indexing

Disk

InternalMemory

A limitation is n < M

Typically 1 I/O

What about n > M

Page 8: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

The String B-tree

29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23

29 2 26 13 20 25 6 18 3 14 21 23

29 13 20 18 3 23

PT PT PT

PT PT PT PT PT PT

PTSearch(P) •O((p/B) logB n) I/Os•O(occ/B) I/Os

It is dynamic...

1 string checked : O(p/B)

O(logB n) levels

+

Lexicographic position of P

[Ferragina-Grossi, J.ACM 1999]

Knuth, vol 3°, pag. 489: “elegant”

Page 9: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

GA

1

65

3

6

4

0

4 5 6 72 3

AGA GCGC

GGAGAG C

A G A G A

On Front-Coding…

AGAAGA5 G3 C0 GCGCAGA6 G4 GGA6 GA

Knuth

In-order visit+

Path covering

Front Coding

3

0

Compacted Trie =

FC + tree structureWhat about other traversals ?

FC + ... is searchable

Page 10: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Why pre-order visit

In Front-coding theLcp information is encoded many times

GA

1

65

3

6

4

0

4 5 6 72 3

AGA GCGC

GGAGAG C

A G A G A

3

0

AGAAGA1 G3 C4 GCGCAGA1 G3 GGA1 GA

RearCoding

Page 11: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Text Indexing

Page 12: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

What do we mean by “Indexing” ?

Word-based indexes, here a notion of “word” must be devised !» Inverted lists, Signature files, Bitmaps.

Full-text indexes, no constraint on text and queries !» Suffix Array, Suffix tree, String B-tree,...

Page 13: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Basic notation and facts

Occurrences of P in T = All suffixes of T having P as a prefix

SUF(T) = Sorted set of suffixes of T

T = mississippi mississippi 4,7P = si

T[i,N]

iff P is a prefix of the i-th suffix of T (ie. T[i,N])

TPi

Pattern P occurs at position i of T

From substring searchTo prefix search

Reduction

Page 14: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

The Suffix Tree

T# = mississippi# 2 4 6 8 10

12

11 8

5 2 1 10 9

7 4

6 3

0

4

#

i

ppi#

ssi

mississip

pi#

1

p

i# pi#

2

1

s

i

ppi#

ssippi#

3

si

ssippi#

ppi#

1

#

ppi#ssip

pi#

Label = <pos,len>

Space: #nodes

Search pattern P

Maximal repeatedsubstring = node

Page 15: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

The Suffix Array

Prop 1. All suffixes in SUF(T) with prefix P are contiguous.

P=si

T = mississippi#

#i#ippi#issippi#ississippi#mississippi#pi#ppi#sippi#sissippi#ssippi#ssissippi#

SUF(T)

Suffix Array• SA: N log2 N) bits

• Text T: N chars In practice, a total of 5N bytes

SA

121185211097463

T = mississippi#

suffix pointer

5

Prop 2. Starting position is the lexicographic one of P.

Page 16: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Searching a pattern

Indirected binary search on SA: O(p) time per suffix cmp

T = mississippi#SA

121185211097463

P = si

P is larger

2 accesses per step

Page 17: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Searching a patternIndirected binary search on SA: O(p) time per suffix cmp

T = mississippi#SA

121185211097463

P = si

P is smaller

Suffix Array search• O(log2 N) binary-search steps

• Each step takes O(p) char cmp

overall, O(p log2 N) time

+[Manber-Myers, ’90]

Page 18: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Locating the occurrences

T = mississippi# 4 7SA

121185211097463

si#

occ=2

121185211097463

121185211097463 si$

Suffix Array search• O (p + log2 N + occ) time

where # < < $

sissippisippi

Page 19: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Text mining

• How long is the common prefix between T[i,...] and T[j,...] ?

• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=jLcp

01140010213

#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#

12

1185211097463

SA

Lcp(7,3) = 1= min{2,1,3}

Page 20: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Text mining

• Does it exist a repeated substring of length ≥ L ?

• Maximal Lcp of a suffix is with its adjacent

• Search for Lcp[i] ≥ L

Lcp

01140010213

#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#

12

1185211097463

SA

Page 21: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Lcp

01140010213

Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Text mining

Does exist a substring of length ≥ L occurring ≥ C

times ?

• Exist ≥ C equal substrings of length ≥ L chars

• Exist ≥ C suffixes sharing a prefix of ≥ L chars

• These suffixes may be not contiguous, but...

• Their “block” has a common prefix of ≥ L

chars

• Search for Lcp[i,i+C-2] whose entries are

≥ L

#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#

12

1185211097463

SA

L = 1, C = 4

Page 22: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

How to construct SA from T ?

#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#

12

1185211097463

SA

Elegant but inefficient

Obvious inefficiencies:• (n2 log n) time in the worst-case• (n log n) cache misses or I/O faults

Input: T = mississippi#

Page 23: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

The skew algorithm The key problem: Compare efficiently two

suffixes Brute-force = (n) time per cmp, (n2 log n)

total

In order to sort the suffixes of S1. Divide the suffixes of S in two groups

S0,2 = suffixes starting at positions 0 mod 3 or 2 mod 3

S1 = suffixes starting at positions 1 mod 3

2a. Sort recursively S0,2 (they are 2n/3)

2b. Sort S1: suffix(3i+1) = S[3i+1] suff(3i+2)

3. Merge the sorted S0,2 with the sorted S1

T(n) = O(split) + T(2n/3) + O(|S1|) + O(merge) = O(n)

Page 24: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Sort recursively S0,2

We turn this problem into the SA-construction of a shorter string of length (2/3)n.

S=AAT GTG AGA TGA $$$

RadixSort all triplets that start at positions 0,2 mod

3 T = {ATG, TGT, TGA, GAG, GAT, ATG, GA$, A$$} Sort(T) = (A$$, ATG, GA$, GAG, GAT, TGA, TGT)

Assign lexicographic names (log n bits) A$$=1, ATG=2, GA$=3,…

Build s0,2 and encode it: ATG TGA GAT GA$ TGT GAG ATG A$$ 2 6 5 3 7 4 2 1

1 2 3 4 5 6 7 8 9 10 11 12 13

Page 25: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Sort recursively S0,2

Given

S=AAT GTG AGA TGA $$$

We have built:

s0,2 = ATG TGA GAT GA$ TGT GAG ATG A$$ enc(s0,2) = 2 6 5 3 7 4 2 1

It is SA0,2 = [12, 9, 2, 11, 6, 8, 5, 3]

1 2 3 4 5 6 7 8 9 10 11 12 13

A suffix of s0,2

A suffix of enc(s0,2)

SA(enc(s0,2)) gives SA0,2

Lex-order ispreserved

Page 26: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Sort S1

We turn this problem into the sort of pairs

S=AAT GTG AGA TGA $$$

Key observation: suff(1) = <A, pos(2)> = <A,3>

suff(7) = <A, pos(8)> = <A,6>

SA0,2 = [12, 9, 2, 11, 6, 8, 5, 3]

Suffix of S1

1 2 3 4 5 6 7 8 9 10 11 12 13

SA1 = [1, 7, 4, 10]

Page 27: Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

The merge step

To merge suffix si in S0,2 with suffix sk in S1, note that

If (i mod 3) = 2 si+1 and sk+1 belong to S0,2

If (i mod 3) = 0 si+2 and sk+2 belong to S0,2

their order can be derived from SA0,2 in O(1) time

SA1 SA0,2

SA

T(n) = T(2n/3) + O(n) + O(merge) = O(n)

S=AAT GTG AGA TGA $$$1 2 3 4 5 6 7 8 9 10 11 12 13