Non-breaking Similarity of Genomes with Gene Repetitions

04/19/231

Non-breaking Similarity of Genomes with Gene Repetitions

Binhai Zhu

Computer Science Department, Montana State University

Joint work with Zhixiang Chen, Bin Fu, Jinhui Xu, Boting Yang and Zhiyu Zhao

04/19/232

Background

Computing genomic distance between genomes is important in evolutionary molecular biology, the problem was first studied by Sturtevant and Dobzhansky in 1936.

A lot of research has been done on computing genomic distances since 1990, assuming that each gene appears in a genome once, e.g., the famous result by Hannenhalli and Pevzner on sorting signed permutations by reversals.

04/19/233

Background (cond.)

On the other hand, gene repetition is very common in genomes. So computing genomic distances with gene repetition is a more realistic problem.

This is a typical optimization problem, it makes sense to study the approximability of the problem.

04/19/234

DefinitionsGiven n gene families (alphabets) F, a genome

G’ is a sequence of elements of F such that each element has a (+/-) sign.

Example. F={a,b,c,d},

G’=-bd-cab-d-c

We will focus on unsigned sequences in this work.

A genome G is said to be exemplar if every gene appears exactly once in G.

04/19/235

Definitions (cond.)

Given exemplar genomes G and H, over the same set of gene families, if gene ab is a substring in G but not in H, then ab constitutes a breakpoint in G.

Example, G=abcdefg

H=efgdcab

there are 3 breakpoints in G (and symmetrically in H).

The number of breakpoints between G and H is called the breakpoint distance between G and H.

04/19/236

Exemplar Breakpoint Distance Problem

• Given two genomes G’ and H’ over n gene families, compute two exemplar genomes G and H such that the breakpoint distance between G and H is minimized.

• We call this the exemplar breakpoint distance problem (between G’ and H’). Denote this distance by eb(G’,H’)=b(G,H).

04/19/237

Approximation Algorithms• Given a minimization (maximization)

problem Л, let the optimal solution of Л be OPT, an approximation algorithm A provides a performance guarantee of α for Л if for every instance of Л the solution value returned by A is at most x OPT (at least OPT/).

• Usually we say that A is a factor- approximation for Л.

04/19/238

Prior Results (1)• We showed that the exemplar breakpoint

distance problem does not admit any approximation, unless P=NP (or, deciding whether eb(G’H’)=0 is NP-complete) [Chen, Fu and Zhu;2006].

• This result holds for any genomic distance d( ) satisfying G=H implies d(G,H)=0.

• Based on the above result, even under a weaker model of approximation, we showed that the exemplar conserved interval distance problem does not admit any WEAK approximation of a superlinear factor [Chen, Fowler, Fu and Zhu, 2007].

04/19/239

Prior Results (2)• On the other hand, for the exemplar

breakpoint distance problem, Sankoff has used branch-and-bound [Sankoff, 1999] and Nguyen, Tay and Zhang [2005] have used divide-and-conquer on practical datasets to obtain good empirical results.

• As a related, but slightly different effort, Chauve, et al. [2006] studied the exemplar genomic similarity problems which does not satisfy G=H implies d(G,H)=0, e.g., the exemplar common interval measure problem.

04/19/2310

Background for this work• We try to look at the complement of the

breakpoint distance under the gene duplication model.

• As the problem is still hard to approximate, we follow Nguyen, et al. by considering genomes satisfying some practical conditions.

04/19/2311

Definitions• Given exemplar genomes G and H drawn from the same

alphabet, ab is a non-breaking point, if ab appears in both G and H.

Example. G = abcdefg

H = fegcdab

We have two non-breaking points in G and H, which is called the non-breaking similarity of G and H, denoted as nbs(G,H).

Note that when |G|=|H|=n, if G=H, nbs(G,H)=n-1.

• Given genomes G’ and H’ drawn from the same alphabet, possibly with gene repetitions, the exemplar non-breaking similarity problem is to delete redundant genes to obtain exemplar genomes G and H such that nbs(G,H) is maximized. The corresponding measure is also denoted as enbs(G’,H’).

04/19/2312

ExampleG’ = abcadcefg

H’ = cfegcdabf

We have 4 possible exemplar genomes for G’: abcdefg, abdcefg, bcadefg, badcefg.

We have 4 possible exemplar genomes for H’: cfegdab, cegdabf, fegcdab, egcdabf.

enbs(G’,H’)=nbs(abcdefg,fegcdab)=2.

04/19/2313

Inapproximability ResultTheorem 1. Given an exemplar genome G and another genome H’ such that the genes are all from the same alphabet with size n and each gene appears in H’ at most two times, the Exemplar Non-breaking Similarity Problem over G and H’ does not admit any approximation of factor n1-ε, unless P=NP.

Proof Idea: A linear reduction from Independent Set (IS).

04/19/2314

v5v3 v4

v1 v2

e1

e2

e3 e4

e5

G:v1v’1v2v’2v3v’3v4v’4v5v’5x1e1x’1x2e2x’2x3e3x’3x4e4x’4x5e5x’5

H’:YN+M-1YN+M-3…Y1YN+MYN+M-2…Y2 =

x4x’4x2x’2v5e4e5v’5v3e1v’3v1e1e2v’1x5x’5x3x’3x1x’1v4e3e5v’4v2e2e3e4v’2

Yi=viAiv’i, if i ≤N; YN+i=xix’i, if i≤M

H:x4x’4x2x’2v5e5v’5v3v’3v1e1e2v’1x5x’5x3x’3x1x’1v4v’4v2e3e4v’2 correspond to the optimal independent set {v3,v4}

Input graph has an IS of size K iff enbs(G,H’)=K.

N=5 vertices, M=5 edges

N+M is even

04/19/2315

v5v3 v4

v1 v2

e1

e2

e3 e4

e5

G:v1v’1v2v’2v3v’3v4v’4v5v’5x1e1x’1x2e2x’2x3e3x’3x4e4x’4x5e5x’5

H’:YN+M-1YN+M-3…Y1YN+MYN+M-2…Y2 =

x4x’4x2x’2v5e4e5v’5v3e1v’3v1e1e2v’1x5x’5x3x’3x1x’1v4e3e5v’4v2e2e3e4v’2

Yi=viAiv’i, if i ≤N; YN+i=xix’i, if i≤M

H:x4x’4x2x’2v5e5v’5v3v’3v1e1e2v’1x5x’5x3x’3x1x’1v4v’4v2e3e4v’2 correspond to the optimal independent set {v3,v4}

Input graph has an IS of size K iff enbs(G,H’)=K.

N=5 vertices, M=5 edges

N+M is even

04/19/2316

Positive Results

Our motivation was from Nguyen, Tay and Zhang [2005], who observed that for certain bacteria genome pairs (Baphi-Wigg, Pmult-Hinft, Ecoli-Styphi, Xaxo-Xcamp and Ypes), repeated genes are usually pegged, e.g.,

…xyx…aba…

04/19/2317

Positive Results

Definition:

occ(g,G’) is the number of occurrence of g in G’.

span(g,G’) is the maximum distance between two copies of g in G’.

totalocc(c,G’)=∑gene g in G’ with span(g,G’)≥c occ(g,G’)

04/19/2318

Positive Results

Definition:

occ(g,G’) is the number of occurrence of g in G’.

span(g,G’) is the maximum distance between two copies of g in G’.

totalocc(c,G’)=∑gene g in G’ with span(g,G’)≥c occ(g,G’)

Example. G’=abcdaebd

span(a,G’)=4, span(b,G’)=5, span(d,G’)=4,

totalocc(4,G’)=6

04/19/2319

Positive Results Theorem 2. Let G’ and H’ be two genomes with

t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.

04/19/2320



Idea 1: Given an exemplar genome G and another genome H” satisfying span(g,H”)≤c, for every g in H”, we can use divide and conquer to compute enbs(G,H”) in O(nc+2+ε) time.

Roughly speaking, H”=H1H2H3, |H2|=c, then enumerate all solutions on H2 and recurse.

T(n) ≤ 2c+1[2T(n/2+c)] + O(n) ≤ O(nc+2+ε)

04/19/2321



Idea 2: As t is considered as a constant, we enumerate all possibilities for deleting duplicated genes in G’ (to obtain G) and for deleting genes with span greater than c in H’ (to obtain H”). By Lemma 6, there are at most 43└t/3┘ such combinations. Therefore, the total running time is 43└t/3┘O(nc+2+ε) = O(3└t/3┘nc+2+ε) time.

04/19/2322


a total of t genes g satisfying shift(g,G’,H’) >c, for some constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘n2c+1+ε) time.

Example. G’=abcadef

H’=bcedefad

shift(a,G’,H’) = 6

04/19/2323

Conclusion

1. We introduce non-breaking similarity, which is the complement of the famous breakpoint distance, for genome comparison.

2. The general exemplar non-breaking similarity problem is hard to approximate.

3. For some special cases, we can obtain polynomial solutions.

Non-breaking Similarity of Genomes with Gene Repetitions

Documents

exemplar genomes g

genome g

abcdefg h

gene ab

realistic problem

set of gene families

gene duplication model

typical optimization