03/25/22 1 Non-breaking Similarity of Genomes with Gene Repetitions Binhai Zhu Computer Science Department, Montana State University Joint work with Zhixiang Chen, Bin Fu, Jinhui Xu, Boting Yang and Zhiyu Zhao
Dec 31, 2015
04/19/231
Non-breaking Similarity of Genomes with Gene Repetitions
Binhai Zhu
Computer Science Department, Montana State University
Joint work with Zhixiang Chen, Bin Fu, Jinhui Xu, Boting Yang and Zhiyu Zhao
04/19/232
Background
Computing genomic distance between genomes is important in evolutionary molecular biology, the problem was first studied by Sturtevant and Dobzhansky in 1936.
A lot of research has been done on computing genomic distances since 1990, assuming that each gene appears in a genome once, e.g., the famous result by Hannenhalli and Pevzner on sorting signed permutations by reversals.
04/19/233
Background (cond.)
On the other hand, gene repetition is very common in genomes. So computing genomic distances with gene repetition is a more realistic problem.
This is a typical optimization problem, it makes sense to study the approximability of the problem.
04/19/234
DefinitionsGiven n gene families (alphabets) F, a genome
G’ is a sequence of elements of F such that each element has a (+/-) sign.
Example. F={a,b,c,d},
G’=-bd-cab-d-c
We will focus on unsigned sequences in this work.
A genome G is said to be exemplar if every gene appears exactly once in G.
04/19/235
Definitions (cond.)
Given exemplar genomes G and H, over the same set of gene families, if gene ab is a substring in G but not in H, then ab constitutes a breakpoint in G.
Example, G=abcdefg
H=efgdcab
there are 3 breakpoints in G (and symmetrically in H).
The number of breakpoints between G and H is called the breakpoint distance between G and H.
04/19/236
Exemplar Breakpoint Distance Problem
• Given two genomes G’ and H’ over n gene families, compute two exemplar genomes G and H such that the breakpoint distance between G and H is minimized.
• We call this the exemplar breakpoint distance problem (between G’ and H’). Denote this distance by eb(G’,H’)=b(G,H).
04/19/237
Approximation Algorithms• Given a minimization (maximization)
problem Л, let the optimal solution of Л be OPT, an approximation algorithm A provides a performance guarantee of α for Л if for every instance of Л the solution value returned by A is at most x OPT (at least OPT/).
• Usually we say that A is a factor- approximation for Л.
04/19/238
Prior Results (1)• We showed that the exemplar breakpoint
distance problem does not admit any approximation, unless P=NP (or, deciding whether eb(G’H’)=0 is NP-complete) [Chen, Fu and Zhu;2006].
• This result holds for any genomic distance d( ) satisfying G=H implies d(G,H)=0.
• Based on the above result, even under a weaker model of approximation, we showed that the exemplar conserved interval distance problem does not admit any WEAK approximation of a superlinear factor [Chen, Fowler, Fu and Zhu, 2007].
04/19/239
Prior Results (2)• On the other hand, for the exemplar
breakpoint distance problem, Sankoff has used branch-and-bound [Sankoff, 1999] and Nguyen, Tay and Zhang [2005] have used divide-and-conquer on practical datasets to obtain good empirical results.
• As a related, but slightly different effort, Chauve, et al. [2006] studied the exemplar genomic similarity problems which does not satisfy G=H implies d(G,H)=0, e.g., the exemplar common interval measure problem.
04/19/2310
Background for this work• We try to look at the complement of the
breakpoint distance under the gene duplication model.
• As the problem is still hard to approximate, we follow Nguyen, et al. by considering genomes satisfying some practical conditions.
04/19/2311
Definitions• Given exemplar genomes G and H drawn from the same
alphabet, ab is a non-breaking point, if ab appears in both G and H.
Example. G = abcdefg
H = fegcdab
We have two non-breaking points in G and H, which is called the non-breaking similarity of G and H, denoted as nbs(G,H).
Note that when |G|=|H|=n, if G=H, nbs(G,H)=n-1.
• Given genomes G’ and H’ drawn from the same alphabet, possibly with gene repetitions, the exemplar non-breaking similarity problem is to delete redundant genes to obtain exemplar genomes G and H such that nbs(G,H) is maximized. The corresponding measure is also denoted as enbs(G’,H’).
04/19/2312
ExampleG’ = abcadcefg
H’ = cfegcdabf
We have 4 possible exemplar genomes for G’: abcdefg, abdcefg, bcadefg, badcefg.
We have 4 possible exemplar genomes for H’: cfegdab, cegdabf, fegcdab, egcdabf.
enbs(G’,H’)=nbs(abcdefg,fegcdab)=2.
04/19/2313
Inapproximability ResultTheorem 1. Given an exemplar genome G and another genome H’ such that the genes are all from the same alphabet with size n and each gene appears in H’ at most two times, the Exemplar Non-breaking Similarity Problem over G and H’ does not admit any approximation of factor n1-ε, unless P=NP.
Proof Idea: A linear reduction from Independent Set (IS).
04/19/2314
v5v3 v4
v1 v2
e1
e2
e3 e4
e5
G:v1v’1v2v’2v3v’3v4v’4v5v’5x1e1x’1x2e2x’2x3e3x’3x4e4x’4x5e5x’5
H’:YN+M-1YN+M-3…Y1YN+MYN+M-2…Y2 =
x4x’4x2x’2v5e4e5v’5v3e1v’3v1e1e2v’1x5x’5x3x’3x1x’1v4e3e5v’4v2e2e3e4v’2
Yi=viAiv’i, if i ≤N; YN+i=xix’i, if i≤M
H:x4x’4x2x’2v5e5v’5v3v’3v1e1e2v’1x5x’5x3x’3x1x’1v4v’4v2e3e4v’2 correspond to the optimal independent set {v3,v4}
Input graph has an IS of size K iff enbs(G,H’)=K.
N=5 vertices, M=5 edges
N+M is even
04/19/2315
v5v3 v4
v1 v2
e1
e2
e3 e4
e5
G:v1v’1v2v’2v3v’3v4v’4v5v’5x1e1x’1x2e2x’2x3e3x’3x4e4x’4x5e5x’5
H’:YN+M-1YN+M-3…Y1YN+MYN+M-2…Y2 =
x4x’4x2x’2v5e4e5v’5v3e1v’3v1e1e2v’1x5x’5x3x’3x1x’1v4e3e5v’4v2e2e3e4v’2
Yi=viAiv’i, if i ≤N; YN+i=xix’i, if i≤M
H:x4x’4x2x’2v5e5v’5v3v’3v1e1e2v’1x5x’5x3x’3x1x’1v4v’4v2e3e4v’2 correspond to the optimal independent set {v3,v4}
Input graph has an IS of size K iff enbs(G,H’)=K.
N=5 vertices, M=5 edges
N+M is even
04/19/2316
Positive Results
Our motivation was from Nguyen, Tay and Zhang [2005], who observed that for certain bacteria genome pairs (Baphi-Wigg, Pmult-Hinft, Ecoli-Styphi, Xaxo-Xcamp and Ypes), repeated genes are usually pegged, e.g.,
…xyx…aba…
04/19/2317
Positive Results
Definition:
occ(g,G’) is the number of occurrence of g in G’.
span(g,G’) is the maximum distance between two copies of g in G’.
totalocc(c,G’)=∑gene g in G’ with span(g,G’)≥c occ(g,G’)
04/19/2318
Positive Results
Definition:
occ(g,G’) is the number of occurrence of g in G’.
span(g,G’) is the maximum distance between two copies of g in G’.
totalocc(c,G’)=∑gene g in G’ with span(g,G’)≥c occ(g,G’)
Example. G’=abcdaebd
span(a,G’)=4, span(b,G’)=5, span(d,G’)=4,
totalocc(4,G’)=6
04/19/2319
Positive Results Theorem 2. Let G’ and H’ be two genomes with
t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.
04/19/2320
Positive Results Theorem 2. Let G’ and H’ be two genomes with
t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.
Idea 1: Given an exemplar genome G and another genome H” satisfying span(g,H”)≤c, for every g in H”, we can use divide and conquer to compute enbs(G,H”) in O(nc+2+ε) time.
Roughly speaking, H”=H1H2H3, |H2|=c, then enumerate all solutions on H2 and recurse.
T(n) ≤ 2c+1[2T(n/2+c)] + O(n) ≤ O(nc+2+ε)
04/19/2321
Positive Results Theorem 2. Let G’ and H’ be two genomes with
t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.
Idea 2: As t is considered as a constant, we enumerate all possibilities for deleting duplicated genes in G’ (to obtain G) and for deleting genes with span greater than c in H’ (to obtain H”). By Lemma 6, there are at most 43└t/3┘ such combinations. Therefore, the total running time is 43└t/3┘O(nc+2+ε) = O(3└t/3┘nc+2+ε) time.
04/19/2322
Positive Results Theorem 3. Let G’ and H’ be two genomes with
a total of t genes g satisfying shift(g,G’,H’) >c, for some constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘n2c+1+ε) time.
Example. G’=abcadef
H’=bcedefad
shift(a,G’,H’) = 6