Top Banner
Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya 1 and Hjalte Wedel Vildhøj 2 1 Moscow State University, Department of Mechanics and Mathematics, [email protected] 2 Technical University of Denmark, DTU Compute, [email protected] CPM 2013, Bad Herrenalb, Germany June 17, 2013 1 / 27
27

Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

Jul 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

Time-Space Trade-Offs for theLongest Common Substring Problem

Tatiana Starikovskaya1 and Hjalte Wedel Vildhøj2

1Moscow State University, Department of Mechanics and Mathematics,[email protected]

2Technical University of Denmark, DTU Compute, [email protected]

CPM 2013, Bad Herrenalb, GermanyJune 17, 2013

1 / 27

Page 2: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

The Longest Common Substring ProblemDefinition

Problem: Given T1,T2, . . . ,Tm of total length n. Compute the longestsubstring, which appears in at least 2 ≤ d ≤ m strings.

Example

T1 = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

T2 = a c a c c t a c c c t a g

T3 = a c t a g t a a t g c a t

2 / 27

Page 3: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

The Longest Common Substring ProblemDefinition

Problem: Given T1,T2, . . . ,Tm of total length n. Compute the longestsubstring, which appears in at least 2 ≤ d ≤ m strings.

Example

T1 = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

T2 = a c a c c t a c c c t a g

T3 = a c t a g t a a t g c a t

d = 3 ⇒ LCS = c t a g

3 / 27

Page 4: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

The Longest Common Substring ProblemDefinition

Problem: Given T1,T2, . . . ,Tm of total length n. Compute the longestsubstring, which appears in at least 2 ≤ d ≤ m strings.

Example

T1 = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

T2 = a c a c c t a c c c t a g

T3 = a c t a g t a a t g c a t

d = 3 ⇒ LCS = c t a g

d = 2 ⇒ LCS = c t a c c

4 / 27

Page 5: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

The Longest Common Substring ProblemA patented solution

5 / 27

Page 6: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Textbook Solution

1

acctaccctag$

7ctag$

10

$

3

accctag$

t

cc

12$

6ctacct$

1gctagctacct$

ga

2acctaccctag$

8

ctag$

11

$

4

ccctag$

9

g$

at

c

12

$

5

ctag$

8

t$

cc

10

ccctag$

4

g$

g

a

tc

13

$

7

cct$

3

gctacct$

cta

2

gctagctacct$

g

13

$

6

ctag$

9

t$

cc

11

ccctag$

5

g$

g

a

t

Build Generalized Suffix Tree

T1 = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

T2 = a c a c c t a c c c t a g

6 / 27

Page 7: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Textbook Solution

1

acctaccctag$

7ctag$

10

$

3

accctag$

t

cc

12$

6ctacct$

1gctagctacct$

ga

2acctaccctag$

8

ctag$

11

$

4

ccctag$

9

g$

at

c

12

$

5

ctag$

8

t$

cc

10

ccctag$

4

g$

g

a

tc

13

$

7

cct$

3

gctacct$

cta

2

gctagctacct$

g

13

$

6

ctag$

9

t$

cc

11

ccctag$

5

g$

g

a

t

Build Generalized Suffix Tree

T1 = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

T2 = a c a c c t a c c c t a g

7 / 27

Page 8: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Textbook Solution

1

acctaccctag$

7ctag$

10

$

3

accctag$

t

cc

12$

6ctacct$

1gctagctacct$

ga

2acctaccctag$

8

ctag$

11

$

4

ccctag$

9

g$

at

c

12

$

5

ctag$

8

t$

cc

10

ccctag$

4

g$

g

a

tc

13

$

7

cct$

3

gctacct$

cta

2

gctagctacct$

g

13

$

6

ctag$

9

t$

cc

11

ccctag$

5

g$

g

a

t

Space: Θ(n)

/

Build Generalized Suffix Tree

T1 = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

T2 = a c a c c t a c c c t a g

8 / 27

Page 9: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

Our Results

QuestionCan the LCS problem be solved (deterministically) in O

(n1−ε) space

and O(n1+ε

)time for 0 ≤ ε ≤ 1?

Our AnswerYes if 0 ≤ ε ≤ 1

3 . More precisely,

For two strings (d = m = 2), the problem can be solved in:

Time: O(n1+ε

)Space: O

(n1−ε) for any 0 < ε ≤ 1

3 .

In the general case (2 ≤ d ≤ m), the problem can be solved in:

Time: O(

n1+ε log2 n(d log2 n + d2))

Space: O(n1−ε) for any 0 ≤ ε < 1

3 .

9 / 27

Page 10: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries.

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

10 / 27

Page 11: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries.

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

DCτ DCτ DCτ DCτ DCτ DCτ

Difference CoversA difference cover modulo τ is a set of integers DCτ ⊆ {0,1, . . . , τ − 1}such that for any distance d ∈ {0,1, . . . , τ − 1}, DCτ contains twoelements separated by distance d modulo τ .Ex: The set DCτ = {1,2,4} is a difference cover modulo 5.

d 0 1 2 3 4i, j 1,1 2,1 1,4 4,1 1,2

12

4

03

1

4

23

11 / 27

Page 12: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries.

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

DCτ DCτ DCτ DCτ DCτ DCτ

I Number of sampled suffixes: O( nτ |DCτ |

)= O

( n√τ

).

I The LCS is the LCP of two suffixes.I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in

both strings.I Hence the LCS corresponds to a pair (p∗1, p

∗2) maximizing

lcp(RB(p1),RB(p2)

)+ lcp

(T[p1..],T[p2..]

)− 1

12 / 27

Page 13: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries.

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

I Number of sampled suffixes: O( nτ |DCτ |

)= O

( n√τ

).

I The LCS is the LCP of two suffixes.

I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled inboth strings.

I Hence the LCS corresponds to a pair (p∗1, p∗2) maximizing

lcp(RB(p1),RB(p2)

)+ lcp

(T[p1..],T[p2..]

)− 1

13 / 27

Page 14: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries.

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

I Number of sampled suffixes: O( nτ |DCτ |

)= O

( n√τ

).

I The LCS is the LCP of two suffixes.I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in

both strings.

I Hence the LCS corresponds to a pair (p∗1, p∗2) maximizing

lcp(RB(p1),RB(p2)

)+ lcp

(T[p1..],T[p2..]

)− 1

14 / 27

Page 15: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries.

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

I Number of sampled suffixes: O( nτ |DCτ |

)= O

( n√τ

).

I The LCS is the LCP of two suffixes.I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in

both strings.I Hence the LCS corresponds to a pair (p∗1, p

∗2) maximizing

lcp(RB(p1),RB(p2)

)+ lcp

(T[p1..],T[p2..]

)− 1

15 / 27

Page 16: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries.

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

RB(11)= (g c t a c)R = c a t c g

I Number of sampled suffixes: O( nτ |DCτ |

)= O

( n√τ

).

I The LCS is the LCP of two suffixes.I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in

both strings.I Hence the LCS corresponds to a pair (p∗1, p

∗2) maximizing

lcp(RB(p1),RB(p2)

)+ lcp

(T[p1..],T[p2..]

)− 1

16 / 27

Page 17: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

How to compute the pair (p∗1, p∗2) faster than O

( n2

τ

)?

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]

LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]

SARτ = 14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9[ , , , , , , , , , , , , , , , , ]

LCPRτ = 0 1 1 4 3 0 2 4 1 3 2 1 0 2 4 0[ , , , , , , , , , , , , , , , ]

Main observation: lcp(T[p∗1..],T[p∗2..]) ∈ [`max − τ + 1; `max], so we canignore all pairs with lcp values smaller than `max − τ + 1.

17 / 27

Page 18: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

How to compute the pair (p∗1, p∗2) faster than O

( n2

τ

)?

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]

LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]

SARτ = 14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9[ , , , , , , , , , , , , , , , , ]

LCPRτ = 0 1 1 4 3 0 2 4 1 3 2 1 0 2 4 0[ , , , , , , , , , , , , , , , ]

Main observation: lcp(T[p∗1..],T[p∗2..]) ∈ [`max − τ + 1; `max], so we canignore all pairs with lcp values smaller than `max − τ + 1.

18 / 27

Page 19: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

How to compute the pair (p∗1, p∗2) faster than O

( n2

τ

)?

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]

LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]

SARτ = 14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9[ , , , , , , , , , , , , , , , , ]

LCPRτ = 0 1 1 4 3 0 2 4 1 3 2 1 0 2 4 0[ , , , , , , , , , , , , , , , ]

Main observation: lcp(T[p∗1..],T[p∗2..]) ∈ [`max − τ + 1; `max], so we canignore all pairs with lcp values smaller than `max − τ + 1.

19 / 27

Page 20: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

How to compute the pair (p∗1, p∗2) faster than O

( n2

τ

)?

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]

LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]

SARτ = 14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9[ , , , , , , , , , , , , , , , , ]

LCPRτ = 0 1 1 4 3 0 2 4 1 3 2 1 0 2 4 0[ , , , , , , , , , , , , , , , ]

Main observation: lcp(T[p∗1..],T[p∗2..]) ∈ [`max − τ + 1; `max], so we canignore all pairs with lcp values smaller than `max − τ + 1.

20 / 27

Page 21: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

How to compute the pair (p∗1, p∗2) faster than O

( n2

τ

)?

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]

LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]

O(τ)

Analysis (sketch): O(τ) rounds each using O(n/√τ) time and space:

Time: O (n√τ)

Space: O (n/√τ)

Time: O(n1+ε

)Space: O

(n1−ε) 0 < ε ≤ 1

2 .τ = n2ε

21 / 27

Page 22: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is shorter than τ

T1 =

Si

I The LCS is a substring of one of the strings of length 2τ .I Build the generalized suffix tree for a batch Si of strings of total

length O( n√τ

).

I Traverse the suffix tree with T2 in O(n) time to find the node ofgreatest string depth.

I Repeat for all O(√τ) batches.

22 / 27

Page 23: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is shorter than τ

T1 =

Si

I The LCS is a substring of one of the strings of length 2τ .I Build the generalized suffix tree for a batch Si of strings of total

length O( n√τ

).

I Traverse the suffix tree with T2 in O(n) time to find the node ofgreatest string depth.

I Repeat for all O(√τ) batches.

Time: O (n√τ)

Space: O (n/√τ)

Time: O(n1+ε

)Space: O

(n1−ε) 0 ≤ ε ≤ 1

3 .τ = n2ε

τ = O(n/√τ)

23 / 27

Page 24: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A General Solution for m StringsWhen the LCS is long

Challenge: The difference cover property only holds for pairs.

T =

T1 T2 T3 T4

LCS LCS LCS

24 / 27

Page 25: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A General Solution for m StringsWhen the LCS is long

Challenge: The difference cover property only holds for pairs.

T =

T1 T2 T3 T4

LCS LCS LCS

τ

541

532

25 / 27

Page 26: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A General Solution for m StringsWhen the LCS is long

Challenge: The difference cover property only holds for pairs.

T =

T1 T2 T3 T4

LCS LCS LCS

τ

541

532

Algorithm: Extract the maximum head until we have d− 1 distinctstrings. Repeat everything for all n possible positions of the LCS.Computing the next element in a list can be done in O(log n(log2 n + d)).Extracting it costs O(

√τ). At most O(d

√τ) extractions.

Time: O(

n√τ log2 n(log2 n + d)

)Space: O (n/

√τ)

26 / 27

Page 27: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

Conclusion

ResultsFor two strings (d = m = 2), the LCS problem can be solved in:

Time: O(n1+ε

)Space: O

(n1−ε) for any 0 < ε ≤ 1

3 .

In the general case (2 ≤ d ≤ m), the LCS problem can be solved in:

Time: O(

n1+ε log2 n(d log2 n + d2))

Space: O(n1−ε) for any 0 ≤ ε < 1

3 .

Open ProblemsCan the generalized solution be improved? Can the trade-off interval ofour solutions be extended to 0 ≤ ε ≤ 1

2 ? Can the problem be solved inO(n1+ε) time and O(n1−ε) space for any 0 ≤ ε ≤ 1?

27 / 27