Top Banner
Suffix arrays
72

Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Suffix arrays

Page 2: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Suffix array

• We loose some of the functionality but we save space.

Let s = ababSort the suffixes lexicographically: ab, abab, b, bab

The suffix array gives the indices of the suffixes in sorted order

2 0 3 1

Page 3: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

How do we build it ?

• Build a suffix tree• Traverse the tree in DFS,

lexicographically picking edges outgoing from each node and fill the suffix array.

• O(n) time

Page 4: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

How do we search for a pattern ?

• If P occurs in T then all its occurrences are consecutive in the suffix array.

• Do a binary search on the suffix array

• Takes O(mlogn) time

Page 5: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

ExampleLet S = mississippi

iippiissippiississippimississippipi

7

4

1

0

9

8

6

3

10

5

2

ppisippisisippissippississippi

L

R

Let P = issa

M

Page 6: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

How do we accelerate the search ?

L R

Maintain = LCP(P,L)Maintain r = LCP(P,R)

Assume ≥ r

M

r

Page 7: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

L RM

r

If = r then start comparing M to P at + 1

Page 8: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

L RM

r

> r

Page 9: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

L RM

r

Someone whispers LCP(L,M)

LCP(L,M) >

Page 10: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

L RM

r

Continue in the right half

LCP(L,M) >

Page 11: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

L RM

r

LCP(L,M) <

Page 12: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

L RM

r

LCP(L,M) <

Continue in the left half

Page 13: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

L RM

r

LCP(L,M) =

start comparing M to P at + 1

Page 14: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Analysis

If we do more than a single comparison in an iteration then max(, r ) grows by 1 for each comparison O(m + logn) time

Page 15: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Construct the suffix array without the suffix tree

Page 16: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Linear time construction

Recursively ?

Say we want to sort only suffixes that start at even positions ?

Page 17: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Change the alphabet

You in fact sort suffixes of a string shorter by a factor of 2 !

Every pair of characters is now a character

Page 18: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Change the alphabeta$ 0

aa 1

ab 2

b$ 3

ba 4

bb 5

$ a b a a a b

2 1 2

Page 19: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

But we do not gain anything…

Page 20: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Divide into triples

$ y a b b a d a b b a d o

abb

ada bba

do$

Page 21: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Divide into triples

$ y a b b a d a b b a d o

abb

ada bba

do$

$ y a b b a d a b b a d o

bba

dab

bad

o$$

Page 22: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Sort recursively 2/3 of the suffixes

$ y a b b a d a b b a d o

abb

ada bba

do$ bba

dab

bad

o$$

1 2 4 6 4 5 3 7

0 1 6 4 2 5 3 7

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 4 8 2 7 5 10

11

1 2 3 4 5 6 7 8 9 10 11 120

0 1 2 3 4 5 6 7

Page 23: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Sort the remaining third

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

(b, 2)

(a, 5)

(a, 7)

(y, 1)

(b, 2)

(a, 5)

(a, 7)

(y, 1)

36 9 0

1 2 3 4 5 6 7 8 9 10 11 120

1 4 8 2 7 5 10

11

Page 24: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

36 9 0

1 4 8 2 7 5 10

11

Page 25: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

36 9 0

4 8 2 7 5 10

11

6

Page 26: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

39 0

4 8 2 7 5 10

11

6 4

Page 27: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

39 0

8 2 7 5 10

11

6 4 9

Page 28: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

3 0

8 2 7 5 10

11

6 4 9 3

Page 29: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

8 2 7 5 10

11

6 4 9 3 8

Page 30: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

2 7 5 10

11

6 4 9 3 8 2

Page 31: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

7 5 10

11

6 4 9 3 8 2 7

Page 32: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

5 10

11

6 4 9 3 8 2 7 5

Page 33: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

10

11

6 4 9 3 8 2 7 5

Page 34: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

Page 35: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

summary

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

When comparing to a suffix with index 1 (mod 3) we compare the char and break ties by the ranks of the following suffixes

When comparing to a suffix with index 2 (mod 3) we compare the char, the next char if there is a tie, and finally the ranks of the following suffixes

Page 36: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Compute LCP’s

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

1

4

827510110

6

39

Page 37: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Crucial observation

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(i,j) = min {LCP(i,i+1),LCP(i+1,i+2),….,LCP(j-1,j)}

1

4

827510110

6

39

Page 38: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

$ y a b b a d a b b a d o1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(11,0)

16493827510110

0

Find LCP’s of consecutive suffixes

Page 39: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(8,2)

16493827510110

01

Page 40: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(9,3)

16493827510110

010

Page 41: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(6,4)

16493827510110

1 010

Page 42: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(7,5)

16493827510110

01 010

Page 43: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(1,6)

16493827510110

5 01 010

Page 44: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(2,7)

16493827510110

45 01 010

Page 45: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

16493827510110

45 01 010

LCP(3,8)

3

Page 46: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

16493827510110

45 01 010

LCP(4,9)

32

Page 47: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

16493827510110

45 01 010

LCP(5,10)

32 1

Page 48: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

16493827510110

45 01 010

LCP(10,11)

32 1 0

Page 49: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

045 01 010 32 1 0

We need more LCPs for search

Linearly many, calculate the all bottom up

Page 50: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

a b c a b b c a $

2 3 4 5 6 7 8

4

1

1 8 5 2 6 3 7 9

$ca$cabbca$bca$bcabbca$bbca$a$abcabbca$abbca$

973625814

Another example9

2 1 00 31 02

Page 51: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Burrows –Wheeler (bzip2)

Currently best algorithm for textBasic Idea:• Sort the characters by their full context

(typically done in blocks). This is called the block sorting transform.

• Use move-to-front encoding to encode the sorted characters.

The ingenious observation is that the decoder only needs the sorted characters and a pointer to the first character of the original sequence.

Page 52: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

S = abracaשבנויה מכל ההזזות Mיצירת מטריצה : Iשלב

: Sהציקליות של

M =

a b r a c a # b r a c a # ar a c a # a ba c a # a b rc a # a b r aa # a b r a c# a b r a c a

Page 53: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

מיון השורות בסדר לקסיקוגרפי: : IIשלב

a b r a c a #

b r a c a # aa c a # a b r

c a # a b r a

a # a b r a c# a b r a c a

r a c a # a b

LF

L is the Burrows Wheeler Transform

Page 54: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

a b r a c a # b r a c a # ar a c a # a ba c a # a b rc a # a b r aa # a b r a c# a b r a c a

a b r a c a # b r a c a # aa c a # a b r

c a # a b r a

a # a b r a c# a b r a c a

r a c a # a b

Claim: Every column contains all chars.

LF

You can obtain F from L by sorting

Page 55: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

a b r a c a #

b r a c a # aa c a # a b r

c a # a b r a

a # a b r a c# a b r a c a

r a c a # a b

LF

The “a’s” are in the same order in L and in F,

Similarly for every other char.

Page 56: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

From L you can reconstruct the string

L

#

ar

a

ca

b

F

#

a

r

a

c

a

b

What is the first char of S ?

Page 57: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

From L you can reconstruct the string

L

#

ar

a

ca

b

F

#

a

r

a

c

a

b

What is the first char of S ? a

Page 58: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

From L you can reconstruct the string

L

#

ar

a

ca

b

F

#

a

r

a

c

a

b

ab

Page 59: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

From L you can reconstruct the string

L

#

ar

a

ca

b

F

#

a

r

a

c

a

b

abr

Page 60: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Compression ?L

#

ar

a

ca

b

Compress the transform to a string of integers using move to front

0 2 3 2 0 3

Then use Huffman to code the integers

Page 61: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

a b r a c a # b r a c a # ar a c a # a ba c a # a b rc a # a b r aa # a b r a c# a b r a c a

a b r a c a # b r a c a # aa c a # a b r

c a # a b r a

a # a b r a c# a b r a c a

r a c a # a b

Why is it good ?

LF

Characters with the same (right) context appear together

Page 62: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

a b r a c a # b r a c a # ar a c a # a ba c a # a b rc a # a b r aa # a b r a c# a b r a c a

a b r a c a # b r a c a # aa c a # a b r

c a # a b r a

a # a b r a c# a b r a c a

r a c a # a b

Sorting is equivalent to computing the suffix array.

LF

Can encode and decode in linear time

Page 63: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

frocc=2[lr-fr+1]

Substring search using the BWT (Count the pattern occurrences)

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

ipssm#pissii

L

mississippi

# 1i 2m 7p 8s 10

C

Availa

ble in

foP = si

First step

fr

lr Inductive step: Given fr,lr for P[j+1,p] Take

c=P[j]

P[ j ]

Find the first c in L[fr, lr]

Find the last c in L[fr, lr]

L-to-F mapping of these charslr

rows prefixedby char “i” s

s

slide stolen from Paolo Ferragina@

Page 64: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

frocc=2[lr-fr+1]

Substring search using the BWT (Count the pattern occurrences)

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

ipssm#pissii

L

mississippi

# 1i 2m 7p 8s 10

C

Availa

ble in

foP = si

First step

fr

lr

lr

rows prefixedby char “i” s

s

slide stolen from Paolo Ferragina@

What if someone whispers how many “s” we have up to index 2 and up to index 5:

occ(s,2), occ(s,5) ?fr = C[s] + occ(s,2) + 1lr = C[s] + occ(s,5)

Page 65: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

occ( a , j )

ipssm#pissii

L

occ(s,4) = 2

Page 66: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Make a bit vector for each character

ipssm#pissii

L

0 0 1 1 0 0 0 0 1 1 0 0

occ(s,4) = rank(4)

rank(i) = how many ones are there before position i ?

Page 67: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

How do you answer rank queries ?

0 0 1 1 0 0 0 0 1 1 0 0

rank(i) = how many ones are there before position i ?

We can prepare a vector with all answers

0 0 1 2 2 2 2 2 3 4 4 4

Page 68: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Lets do it with O(n) bits per character

0 0 1 1 0 0 1 0 1 1 0 0

Partition in 2n/log(n) blocks of size log(n)/2

0 0 1 1 0 0 0 0 1 1 0 0logn/2

2 5 7

Keep the answer for each prefix of the blocksThere are n “kinds” of blocks, prepare a table with all answers for each block

Page 69: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

0 0 1 1 0 0 1 0 1 1 0 0

In our solution the bit vector takes Θ(n) bits and also the “additionals” take Θ(n) bits

0 0 1 1 0 0 0 0 1 1 0 0logn/2

2 5 7

Page 70: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Can we do it with smaller overhead : so additionals

would take o(n) ?

0 0 1 1 0 0 1 0 1 1 0 0

superblocks of size log2(n)

0 0 1 1 0 0 0 0 1 1 0 0

log2n7 13

Each block keeps the number of one in previous blocks that are in the same superblock

0 0 1 1 0 0 0 0 1 1 0 0

0 0 1 1 0 0 0 0 1 1 0 02 4

Page 71: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Analysis

0 0 1 1 0 0 1 0 1 1 0 0

The superblock table is of size n/log (n)

0 0 1 1 0 0 0 0 1 1 0 0

log2n7 13

The block table is of size (loglog(n)) * n/log (n)

0 0 1 1 0 0 0 0 1 1 0 0

0 0 1 1 0 0 0 0 1 1 0 02 4

The tables for the blocks √n log(n)loglog(n)

So the additionals take o(n) space

Page 72: Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Next step

Do it without keeping the bit vectors themselves

Instead keep only the compressed version of the text

Saves a lot of space for compressible strings