Top Banner
On the Length of the Longest Common Subsequence Peter Rabinovitch
48

On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Jul 06, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

On the Length of the Longest Common Subsequence

Peter Rabinovitch

Page 2: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Summary● Consider two sequence of coin tosses, and

from these two sequences, extract the longest common subsequence. It is known that as the length of the sequences increase, the ratio of the length of the longest common subsequence to the length of the sequence converges to a limit in expectation that is about 0.81, but the exact value of the limit is not known.

● In this talk, we will survey some key results related to the problem, as well as look at several potential approaches to determining the limit.

Page 3: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

A Simple Example

H T H H T

T T H T T

Page 4: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Applications● DNA (alphabet size=4)● Proteins (alphabet size=20)● Computer security (alphabet size=256)

● And all these are more complicated, and more interesting, and more useful with more than two strings.

Page 5: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Formally

● Let X=(X1,X

2,...X

n}, Y=(Y

1,Y

2,...Y

n) be two

sequences of iid Bernoulli r.v.s● P[X

i=H]=P[X

i=T]=P[Y

i=H]=P[Y

i=H]=1/2

● Ln=length of a longest common subsequence

We seek to understand the r.v. Ln

in particular lim E[Ln]/n

Page 6: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

For small n, things can be calculated explicitly

● By explicit enumeration we find– E[L

2]=9/8, V[L

2]=11/8

– E[L3]=29/16, V[L

3]=119/256

● But it gets messy for larger n

Page 7: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Properties

0 20 40 60 80 100

0.0

0.4

0.8

n

E[L

_n]

E[Ln]/n

Sd[Ln]/n

0 20 40 60 80 100

0.0

0.4

0.8

n

E[L

_n]

E[Ln]/n

Sd[Ln]/n

L_100

Freq

uenc

y

70 75 80 85

050

100

150

Appears monotonic, but not yet proved

Could be GaussianL

n, as a function of X

n and Y

n

satisfies several symmetries•Globally switch H & T•Reverse both sequences•Etc

Page 8: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

An Algorithm (1)

T T H T T

T

H

H

T

H

Page 9: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

An Algorithm (2)

T T H T T

T

H

H

T

H

Page 10: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

An Algorithm (3)

T T H T T

T

H

H

T

H 1

1

1

1

2

2

2

2 2

33

Page 11: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Subadditivity, etc.

● A sequence {an} is subadditive if a

m+n≤a

m+a

n for

all positive integers m & n● A sequence {a

n} is superadditive if {-a

n} is

subadditive● Fekete's lemma: if {a

n} is subadditive then

wherelim

ann=inf

ann=

−∞≤∞

Page 12: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Fekete's Lemma (γ>-∞)● For any ε>0 we can find a k s.t. a

k≤(γ+ε)k because γ=inf a

n/n.

● m>0 can be written m=nk+j for the same k with 0≤j<k it follows

am=a

nk+j≤a

nk+a

j≤na

k+a

j≤n(γ+ε)k+max

0≤l<k a

l

so limsupma

m/m ≤ limsup

mn(γ+ε)k/m + limsup

mmax

0≤l<k a

l/m

and then limsupma

m/m ≤ γ+ε

● Also

γ+ε≤ liminfma

m/m+ε

● So

limsupma

m/m ≤ γ+ε≤ liminf

ma

m/m+ε

● As ε>0 was arbitrary, we have

limna

n/n=inf

na

n/n=γ

Page 13: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Existence of the Limit

● an=E[L

n]/n is superadditive (by concatenation)

● So applying Fekete's lemma shows that the limit exists (Chvatal & Sankoff, 1975)

● Deken (1979) shows that Ln/n converges a.s.

Page 14: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Other Results● Aratia & Steele conjecture that c=2√2-2~0.8284● Alexander (1994) proves a rate of convergence

using methods of percolation● Steele (1997?) proves a concentration of

measure results using the Azuma Hoeffding inequality

● Bundschuh (2001) shows that c~0.812653 using simulation, demonstrating that A&S were wrong

● Lueker (2005) bounds 0.788071≤c≤0.826280

Page 15: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

A Heuristic (1)● What is the longest sequence of heads you will

see in n tosses of a fair coin?

Page 16: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

A Heuristic (1)● What is the longest sequence of heads you will

see in n tosses of a fair coin?● The probability of a length m run is pm, and

there are approximately n places where this run could start, so E[# of length m head runs]=npm

● If the longest one is unique, then 1=npm, so the length of the longest head run is log

1/pn

● Note: this can be made precise, eg. Durrett's book has a proof.

Page 17: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

A Heuristic (2)● What is the largest red square in an n by n grid

where each square is coloured red or black by flipping a coin?

Page 18: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

A Heuristic (3) Arratia & Steele's Conjecture

● Call any pair of subsequences of length k where the Xi and Y

i

agree a 'good k pair'● Let Z be the total number of good k pairs of the two length n

strings. Then

● Then E[Z]=(nC

k)2/2k because there are

nC

k to choose each of the

subsequences, which have to agree in k places.● The mode of this sequence is approximately n/(1+√2)● Since every length k common subsequence yields a good k

pair, there are at least Ln

Ck such good k pairs. This sequence

has mode Ln/2.

● Now equate the two to get Ln/n~2√2-2

Page 19: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Solution Methods● We'll focus on two

– Patience sorting● Which has connections to the symmetric group, Young

tableaux, the Tracy Widom distribution (see Aldous & Diaconis' AMS paper “Longest Increasing Subsequences: From Patience Sorting to the Baik-Dieft-Johansson Theorem”)

– Directed last passage percolation on a disordered media

● Which has connections to percolation (see Grimmett's book) as well as (in a suitably relaxed version of the problem) the Tracy Widom distribution

Page 20: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Aside on theTracy Widom Distribution ('94)

● Arises in many new places– LIS of a uniform random permutation– Largest eigenvalue of a random matrix in the

Gaussian Unitary Ensemble (GUE), i.e. complex Hermitian matrices

– Growth models in the plane (one of our “Other Related Models”, later)

F s=exp−∫s∞x−sq2xdx

q ' ' s =sq s2q3 s

Page 21: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting● (3,5,2,1,7,8,9,4,6)

● Put the next element at the bottom of the first column it is less than or equal to.

● If no such column exists, start a new column

Page 22: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3

Page 23: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3 5

Page 24: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting

(3,5,2,1,7,8,9,4,6)

32

5

Page 25: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3

12

5

Page 26: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3

12

5 7

Page 27: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3

12

5 7 8

Page 28: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3

12

5 7 8 9

Page 29: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3

12 4

5 7 8 9

Page 30: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3

12 4

567 8 9

Page 31: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting

(3,5,2,1,7,8,9,4,6)

Thus we see that a LIS is of length 5

3

12 4

567 8 9

Page 32: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting Applied to LCS● Let X=(HTHT), Y=(THHT)● Form y

T=(0,3) and y

H=(1,2)

● Reverse them: yT=(3,0) and y

H=(2,1)

● Replace ith element of X with yT or y

H depending on

value of Xi. Call this list z. z=(21302130), and do

patience sorting on z.2 3 3

1 2

0 1

0

Page 33: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting Applied to LCS● So why is this interesting?

– LIS has been solved.● Why isn't this a solution?

– In the LIS case, the distribution is uniform over all possible permutations

– In the LCS case, we don't have permutations, but rather words (i.e. repeated elements)

● The work on LIS has been largely extended to the random word case

– In the LCS case, the distribution is NOT uniform – there are forbidden words, etc.

Page 34: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting Applied to LCS● But...

● This seems likely to be true

Page 35: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Patience Sorting Applied to LCS● But...

● This is unknown, simulations are slooooooowwwww...

Page 36: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Percolation● Percolation is a huge area of probability

– See, for example, books by Grimmett, as well as Bollobas

Page 37: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Directed Last Passage Percolation● At each vertex, there is a passage time (or

weight)– typically iid exponential or geometric rvs

● There is a set of allowed paths– typically up-right, or strictly up-right

● The question is what is the maximum time (or weight) path from the origin to (x,y)

Page 38: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Directed Last Passage Percolation

1

-∞

1 1 1

1

1 1 1

11

1 -∞ -∞ -∞

-∞

-∞

-∞-∞-∞-∞

-∞

-∞

-∞ -∞

1

-∞

1 1 1

1

1 1 1

11

1 -∞ -∞ -∞

-∞

-∞

-∞-∞-∞-∞

-∞

-∞

-∞ -∞

Last passage time = 4

●Strictly up-right paths●Weights chosen by flipping a coin on each square

● H->Green, weight = 1● T->red, weight = -∞

Page 39: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Directed Last Passage Percolation●Strictly up-right paths●Weights chosen by flipping a coin on each axis

● Coordinate flips agree->Green, weight = 1● Coordinate flips disagree->red, weight = -∞

T T H T T

T

H

H

T

H 1

1

1

1

1

1

1

1 1

11

T T H T T

T

H

H

T

H 1

1

1

1

1

1

1

1 1

11

Last passage time = 3

This is LCS

-∞

-∞

-∞

-∞

-∞

-∞ -∞ -∞ -∞

-∞

-∞-∞

-∞

-∞

-∞

-∞

-∞ -∞

-∞

-∞ -∞

-∞-∞

-∞ -∞-∞

-∞

-∞

Page 40: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Directed Last Passage Percolation

1

1

1

1

2

2

2

2 2

33

Page 41: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Directed Last Passage Percolation

1

1

1

1

2

2

2

2 2

33

1

1

1

1

2

2

2

2 2

33

Page 42: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Directed Last Passage Percolation

1

1

1

1

2

2

2

2 2

33

1

1

1

1

2

2

2

2 2

33

0 0 1 1

1

1

1

1

1

2

2 2

2

2

1

1

1

1

2

2

2

2 2

33

Page 43: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Directed Last Passage Percolation

1

1

1

1

2

2

2

2 2

33

1

1

1

1

2

2

2

2 2

33

0 0 1 1

1

1

1

1

1

2

2 2

2

2

1

1

1

1

2

2

2

2 2

33

0 0 1 1

1

1

1

1

1

2

2 2

2

2

1

1

1

1

2

2

2

2 2

33

Page 44: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Directed Last Passage Percolation

1

1

1

1

2

2

2

2 2

33

1

1

1

1

2

2

2

2 2

33

0 0 1 1

1

1

1

1

1

2

2 2

2

2

1

1

1

1

2

2

2

2 2

33

0 0 1 1

1

1

1

1

1

2

2 2

2

2

1

1

1

1

2

2

2

2 2

33

Page 45: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Directed Last Passage Percolation

1

1

1

1

2

2

2

2 2

33

1

1

1

1

2

2

2

2 2

33

0 0 1 1

1

1

1

1

1

2

2 2

2

2

1

1

1

1

2

2

2

2 2

33

0 0 1 1

1

1

1

1

1

2

2 2

2

2

1

1

1

1

2

2

2

2 2

33

Page 46: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Why doesn't it work?● Most (all?) the DLPP results are for iid (in fact

geometric and exponential) weights. In the LCS case, the weights are related.

● So apply techniques from statistical mechanics of disordered systems, i.e. spin glasses?

Page 47: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

Other Related Models● Bernoulli Model

– Seppalainen's result● limE[L

n]/n=2√2-2

– Majumdar's result● L

n≈(2√2-2)n+21/6(√2-1)4/3n1/3TW

● Large alphabet– Kiwi et al: γ

k√k->2 as k->∞

● Many strings– Not much other than Dancik showing that the limit

exists (same proof as in 2 string case)

Page 48: On the Length of the Longest Common Subsequence Seminar Presentation.pdf · Let Z be the total number of good k pairs of the two length n strings. Then Then E[Z]=(n C k)2/2k because

“We have lots of bricks, but we don't know what the building looks like yet.”