Top Banner
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 and Heejin Park 2 1 School of Electrical and Computer Engineering, Pusan National Univ. 2 College of Information and Communications, Han yang Univ.
38

A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Dec 17, 2015

Download

Documents

Francis Patrick
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

A New Compressed Suffix Tree Supporting Fast Search and its

Construction Algorithm Using Optimal Working Space

Dong Kyue Kim1 and Heejin Park2

1 School of Electrical and Computer Engineering, Pusan National Univ.

2 College of Information and Communications, Hanyang Univ.

Page 2: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Contents

Preliminaries

Previous results

Our contribution

Conclusion

Page 3: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Suffix Tree

The suffix tree (ST) of a text T A compacted trie for all the suffixes of T.

An example for accagat#.

agat#

c

accagat# at#

gat# t#

a g

c g

cagat# ccagat#

t #

#

t c a

We assume that # is the lexicographically smallest special symbol.

Page 4: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Suffix Array The suffix array (SA) of a text T

pos array

lcp array

Page 5: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Suffix Array

pos

The pos array of T stores the starting positions of the lexicographically sorted suffixes of T.

1 8 #

2 1 a c c a g a t #

3 4 a g a t #

4 6 a t #

5 3 c a g a t #

6 2 c c a g a t #

7 5 g a t #

8 7 t #

The suffix array (SA) of a text Tpos array

lcp array

T = accagat#

Page 6: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Suffix Array

pos lcp

The pos array of T stores the starting positions of the lexicographically sorted suffixes of T.

The lcp array of T stores the length of the longest common prefix of every adjacent suffixes in the pos array.

For example, lcp[3] stores 1 that is the length of the longest common prefix of accagat# and agat#.

1 8 #

2 1 0 a c c a g a t #

3 4 1 a g a t #

4 6 1 a t #

5 3 0 c a g a t #

6 2 1 c c a g a t #

7 5 0 g a t #

8 7 0 t #

The suffix array (SA) of a text Tpos array

lcp array

T = accagat#

Page 7: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Storing Suffix Trees in Arrays

Suffix trees can be stored in arrays if it is used as a static data structure. If a suffix tree is used as a static data structure, they can be implem

ented using arrays instead of using nodes and pointers in a similar way a complete binary tree is stored in an array.

Array-based data structures storing suffix trees

Enhanced suffix arrays (ESA)

Linearlized suffix trees (LST)

Page 8: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Enhanced Suffix Array

Enhanced suffix arraydeveloped by Abouelhoda et al. [SPIRE ’02, WABI ’02, JDA ’04]

a pos array + an lcp array + a child table

The child table is an array implementation of the suffix tree topology whose node branching is implemented by the linked list.

Pattern search takes O(m|Σ|) time. m: pattern length, |Σ|: size of alphabets

Page 9: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Linearlized Suffix Tree

Linearlized suffix treeAn improvement on ESA developed by Kim et al. [SPIRE ’04]

a pos array + an lcp array + a new child table

The new child table is an array implementation of the suffix tree topology whose node branching is implemented by the complete binary tree.

Pattern search takes O(m log |Σ|) time. m: pattern length, |Σ|: size of alphabets

Page 10: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Compressed Full-text Indices

Compressed full-text indices Occupy O(n log|Σ|)-bit space. All full-text indices (ST, SA, ESA, LST) we just introd

uced occupy O(n)-word space.

Compressed suffix array (CSA) Succinct representation of pos array.

Compressed suffix tree (CST) Succinct representation of a pos array, an lcp array, and

a suffix tree topology.

Page 11: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Previous Results

Munro et al. [1998], Sadakane[2002] A succinct representation of a suffix tree topology

Grossi and Vitter [2000] A succinct representation of a pos array

Sadakane [2002] A succinct representation of an lcp array

These data structures require O(n log|Σ|)-bit space, however, when they were introduced, the working space is more than O(n log|Σ|) bits.

Page 12: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Previous Results

Hon et al.[2002][2003] developed O(n log|Σ|)-bit working space algorithms for constructing CSTs and CSAs that run in O(n logεn) time.

Their construction algorithm for CSTs can construct CSTs supporting O(n logεn |Σ|)-time pattern search.

However, it cannot construct CSTs supporting O(n logεn log|Σ|)-time pattern search.

Page 13: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Our Contribution

We first present a new CST supporting O(n logεn log|Σ|)-time pattern search.

Then, we present an algorithm for constructing the new CST running in optimal O(n log|Σ|)-bit working space and O(n logεn) time.

Page 14: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

New Compressed Suffix Tree

Our new compressed suffix tree is a succinct representation of the linearlized suffix tree (LST).

a succinct representation of a pos array, a succinct representation of an lcp array, and a succinct representation of a child table, which stores a

suffix tree topology.

Page 15: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

New Compressed Suffix Tree

Succinct representation of a pos array and an lcp array are the same as before. a succinct representation of a pos array (Grossi & Vitter) a succinct representation of an lcp array (Sadakane)

Succinct representation of a child table, which stores a suffix tree topology, is a new one.

Page 16: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Previous Compressed Suffix Tree Topology

Previous succinct representation of a suffix tree is a Parentheses representation.In this representation, every node is represented by a pair of parentheses. A pair of parentheses of a node encloses its children’s parentheses.

32 54 7 861

( () (() () ()) (() ()) () ())

Page 17: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Previous Compressed Suffix Tree Topology

32 54 7 861

( () (() () ()) (() ()) () ())

In this representation, parent-child relationship is stored implicitly.

To find a child of a node, a range-minima query is required.

Page 18: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

New compressed tree topology

Our succinct representation differs from the previous one in that we store the parent-child relationship explicitly rather than implicitly.

Range-minima query is not required.

Page 19: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Child Table

We first describe a child table and then the succinct representation of a child table, i.e., the compressed child table.

A child table stores an lcp-interval tree that is a modification of a suffix tree. We first show how to modify a suffix tree to an lcp-inte

rval tree. Then, how to store an lcp-interval tree into a child table.

Page 20: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Child Tablesuffix tree lcp-interval tree child table

agat#accagat# at#

gat# t#

cagat# ccagat#

32 54

7 8

6

1#

The suffix tree for accagat#

agat#accagat#

at#

gat# t#

cagat# ccagat#

32

5

4

7 8

61#

The suffix tree for accagat#

whose node branching is a complete binary tree

Page 21: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Child Tablesuffix tree lcp-interval tree child table

agat#accagat#

at#

gat# t#

cagat# ccagat#[1]#

agat#accagat#

at#

gat# t#

cagat# ccagat#

32

5

4

7 8

61#

[1..8]

[1..6] [7..8]

[1..4] [5..6]

[2..4]

[2..3]

Each node in the suffix tree is replaced by the interval in the pos array

which stores the suffixes in the subtree rooted at the node.

[2] [3]

[5] [6]

[4]

[7] [8]

lcp-interval tree

Page 22: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Child Tablesuffix tree lcp-interval tree child table

lcp-interval tree

[1..8]

[1..6] [7..8]

[1..4] [5..6]

[2..4]

[2..3]

[1]

[2] [3]

[8][7]

[4]

[5] [6]

1 2 3 4 5 6 7

7 4 3 2 6 5 8

child table

Each interval [i..j] have only to store the first index of its right child, denoted by child(i,j), so that it can compute its two children.

Interval [1..8] have only to store 7 to compute its two children [1..6] and [7..8].

Interval [1..6] stores 5 to compute its two children [1..4] and [5..6].

Page 23: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Child Tablesuffix tree lcp-interval tree child table

lcp-interval tree

[1..8]

[1..6] [7..8]

[1..4] [5..6]

[2..4]

[2..3]

[1]

[2] [3]

[8][7]

[4]

[5] [6]

1 2 3 4 5 6 7

7 4 3 2 6 5 8

child table

Where is child(i,j) stored?

We store child(i,j) in cldtab[i] or cldtab[j].

If [i..j] is a right child,

child(i,j) is stored in cldtab[i]. If [i..j] is a left child,

child(i,j) is stored in cldtab[j].

Interval [7..8] is a right child so

child(7,8) = 8 is stored in cldtab[7]. Interval [1..6] is a left child so

child(1,6) = 5 is stored in cldtab[6].

Page 24: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Compressed Child Table

[1..8]

[1..6] [7..8]

[1..4] [5..6]

[2..4]

[2..3]

[1]

[2] [3]

[8][7]

[4]

[5] [6]

1 2 3 4 5 6 7

7 4 3 2 6 5 8

child table

1 0 0 1 0 1 0

0 0 0 1 0 0 0

difference child table

diff

sign

child table difference child table compressed child table

Difference child table diff array sign array

Page 25: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Compressed Child Table

[1..8]

[1..6] [7..8]

[1..4] [5..6]

[2..4]

[2..3]

[1]

[2] [3]

[8][7]

[4]

[5] [6]

1 2 3 4 5 6 7

7 4 3 2 6 5 8

child table

1 0 0 1 0 1 0

0 0 0 1 0 0 0

difference child table

diff

sign

child table difference child table compressed child table

Difference child table diff array sign array

In a diff array, instead of storing child(i,j), we store min{j-child(i,j), child(i,j)-i}. For an interval [1..4] whose child(1,4) = 2,

we compute 4-2=2 and 2-1=1 and the minimum 1 is stored in diff[4].

Page 26: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Compressed Child Table

[1..8]

[1..6] [7..8]

[1..4] [5..6]

[2..4]

[2..3]

[1]

[2] [3]

[8][7]

[4]

[5] [6]

1 2 3 4 5 6 7

7 4 3 2 6 5 8

child table

1 0 0 1 0 1 0

0 0 0 1 0 0 0

difference child table

diff

sign

Difference child table diff array sign array

In a diff array, instead of storing child(i,j), we store min{j-child(i,j), child(i,j)-i}. For an interval [1..4] whose child(1,4) = 2,

we compute 4-2=2 and 2-1=1 and the minimum 1 is stored in diff[4].

Whether diff[i] stores j-child(i,j) or child(i,j)-i is indicated by sign[i]. It stores 0 if j-child(i,j) is stored in diff[i] and 1 if child(i,j)-i is stored. Since diff[4] stores child(1,4)-1, sign[4] stores 1.

child table difference child table compressed child table

Page 27: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Compressed Child Table

Compressed child table Compressed diff array sign array

child table difference child table compressed child table

Page 28: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Compressed Child Table

4 0 3 1 0 1 0diff

Compressed child table Compressed diff array

C array: a concatenated bit string of the integers in the diff array D array: a bit string of the same length as C array where most bits are 0

except the starting bit of each integer in the diff array Data structures for rank and select for D array to find the ith leftmost 1 i

n the D array sign array

child table difference child table compressed child table

100 0 11 1 0 1 0

100 1 10 1 1 1 1

C array

D array

Page 29: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Compressed Child Table

Space consumption of a compressed child table Compressed child table requires 5n + o(n) bits.

C array: 2n bits D array: 2n bits Data structures for rank and select: o(n) bits sign array: n bits

Page 30: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Construction AlgorithmWe construct the compressed child table directly from the lcp array without building a suffix tree or an lcp-interval tree as intermediate data structures.

The child table can be constructed directly from the lcp array in O(n) time due to Kim et al [SPIRE2004].

They first construct the extended the lcp array and then compute the child table.

We modify their construction algorithm so that it constructs the compressed child table directly from the compressed lcp array.

Page 31: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Construction Algorithm

The construction algorithm consists of two procedures EXTLCP and CHILD.

Procedure EXTLCP constructs the compressed extended lcp array from the compressed lcp array.

Procedure CHILD constructs the compressed child table which are the C, D, and sign arrays from the compressed extended lcp array.

Page 32: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Construction Algorithm

Pseudo-code for EXTLCP

Page 33: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Construction Algorithm

To optain the O(n log|Σ|)-bit working space, the size of temporary data structures should be O(n log|Σ|).

Page 34: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Construction Algorithm

To optain the O(n log|Σ|)-bit working space, the size of temporary data structures should be O(n log|Σ|).

Arrays ranking an numchild is of size O(n log|Σ|) because a node may have |Σ| childrens and each entry of the array consumes log|Σ| bits

Page 35: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Construction Algorithm

To optain the O(n log|Σ|)-bit working space, the size of temporary data structures should be O(n log|Σ|).

The size of the stack is O(n log|Σ|) because it can be encoded by δ-code.

Page 36: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Construction Algorithm Pseudo-code for CHILD

We also developed some techniques to reduce the working space.

Page 37: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Conclusion

We presented a new compressed suffix tree supporting O(n logεn log|Σ|)-time pattern search that consumes 5n + o(n) bit-space.

We also presented a construction algorithm for our compressed suffix tree running in O(n log|Σ|)-bit working space and O(n logεn) time.

Page 38: A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Compressed Child Table

Space consumption of a compressed child table Compressed child table requires 5n + o(n) bits.

C array: 2n bits S(n) = max {k=1..n/2} {S(k)+S(n-k)+log(k+1)}

D array: 2n bits Data structures for rank and select: o(n) bits sign array: n bits