ReCombinatorics: Phylogenetic Networks with Recombination CPM, June 18, 2008 Pisa, Italy Two recent results and Two Open Questions
ReCombinatorics: Phylogenetic Networks with Recombination
CPM, June 18, 2008
Pisa, Italy
Two recent results and Two Open Questions
What is population genomics?
• The Human genome “sequence” is done.• Now we want to sequence many individuals
in a population to correlate similarities and differences in their sequences with genetic traits (e.g. disease or disease susceptibility).
• Presently, we can’t sequence large numbers of individuals, but we can sample the sequences at SNP sites.
SNP Data
• A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).
• SNP maps have been compiled with a density of about 1 site per 1000.
• SNP data is what is mostly collected in populations - it is much cheaper to collect than full sequence data, and focuses on variation in the population, which is what is of interest.
Haplotype Map Project: HAPMAP
• NIH lead project ($100M) to find common SNP haplotypes (“SNP sequences”) in the Human population.
• Association mapping: HAPMAP used to try to associate genetic-influenced diseases with specific SNP haplotypes, to either find causal haplotypes, or to find the region near causal mutations.
• The key to the logic of Association mapping is historical recombination in populations. Nature has done the experiments, now we try to make sense of the results.
The Perfect Phylogeny Model for SNP sequences
00000
1
2
4
3
510100
1000001011
00010
01010
12345sitesAncestral sequence
Extant sequences at the leaves
Site mutations on edgesThe tree derives the set M:1010010000010110101000010
Only one mutation per siteallowed.
Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs:
0,0 and 0,1 and 1,0 and 1,1
This is the 4-Gamete Test
When can a set of sequences be derived on a perfect phylogeny?
A richer model
00000
1
2
4
3
510100
1000001011
00010
01010
12345101001000001011010100001010101 added
Pair 4, 5 fails the fourgamete-test. The sites 4, 5are incompatible.
Real sequence histories often involve recombination.
M
10100 01011
5
10101
The first 4 sites come from P (Prefix) and the sitesfrom 5 onward come from S (Suffix).
P S
Sequence Recombination
A recombination of P and S at recombination point 5.
Single crossover recombination
Network with Recombination: ARG
00000
1
2
4
3
510100
1000001011
00010
01010
12345101001000001011010100001010101 new
10101
The previous tree with onerecombination event now derivesall the sequences.
5
P
S
M
A Min ARG for Kreitman’s data
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
ARG created by SHRUB
An illustration of why we are interested in recombination:
Association Mapping of Complex Diseases Using
ARGs
Association Mapping
• A major strategy being practiced to find genes influencing disease from haplotypes of a subset of SNPs.– Disease mutations: unobserved.
• A simple example to explain association mapping and why ARGs are useful, assuming the true ARG is known.
0 1 0 0 1
Disease mutation site
SNPs
00000
52
3
3
4S
P
PS
1
4
a:00010
b:10010
c:00100
10010
01100
d:10100
e:01100
00101
01101
f:01101
g:00101
00100
00010
Very Simplistic Mapping the Unobserved Mutation of Mendelian Diseases with ARGs
Diseased
Assumption (for now): A sequence is diseased iff it carries the single disease mutation
Where is the disease mutation?
1 2 3 4 5
What part of 01100 d, e, f inherit?
d: e:f:
? ?
The single disease mutation occurs between sites 2 and 3!
Mapping Disease Gene with Inferred ARGs
• “..the best information that we could possibly get about association is to know the full coalescent genealogy…” – Zollner and Pritchard, 2005
• But we do not know the true ARG! • Goal: infer ARGs from SNP data for
association mapping– Not easy and often approximation (e.g. Zollner and
Pritchard)– Improved results to do the inference Y. Wu (RECOMB 2007)
Results on Reconstructing the Evolution of SNP Sequences
• Part I: Clean mathematical and algorithmic results: Galled-Trees, near-uniqueness, graph-theory lower bound, and the Decomposition theorem, Forest Theorem and the History Lower bound.
• Part II: Practical computation of Lower and Upper bounds on the number of recombinations needed. Construction of (optimal) phylogenetic networks; uniform sampling; haplotyping with ARGs; LD mapping …
• Part III: Varied Biological Applications• Part IV: Extension to Gene Conversion• Part V: The Minimum Mosaic Model of Recombination (CPM 2007)
This talk will discuss two topics in Part I
Minimizing Recombinations in unconstrained networks
• Problem: given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations used to generate M, allowing only one mutation per site. This has biological meaning in appropriate contexts.
• The general minimization problem is NP-hard.
• We can solve this problem in poly-time for the special case of Galled-Trees, to be defined.
The Decomposition Theorem
Since the minimization problem is NP-hardwe want to break up a problem into subproblems that can be solved separately and combined.
Incompatible Sites
A pair of sites (columns) of M that fail the
4-gametes test are said to be incompatible.
A site that is not in such a pair is compatible.
0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1
1 2 3 4 5abcdefg
1 3
4
2 5
Two nodes are connected iff the pairof sites are incompatible, i.e, fail the 4-gamete test.
Incompatibility Graph G(M)
M
THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.
The connected components of G(M) are very informative
For example we have the Theorem:
The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network. We will see
that the non-trivial connected components are the key to the finestpossible decomposition, and have other essential uses.
Recombination Cycles
• In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet.
• The cycle specified by those two paths is called a ``recombination cycle”.
A maximal set of intersecting cycles forms a Blob
00000
52
3
3
4S
p
PS
1
4
10010
0110000101
01101
00100
00010
If directions on the edges are removed, a blob isa bi-connected component of the network.
Blobed Trees
• Contracting each blob in a network results in a directed, rooted tree, otherwise one of the “blobs” was not maximal. Simple, but key insight.
• So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree.
The blobs are the non-tree-like parts of the network.
A blob that is just a single cycle is called a “gall”, and a network where all blobs are galls is called a ``Galled-Tree”.
Ugly tanglednetwork insidethe blob.
Every network is a tree of blobs.
A network where every blob is a single cycle is a Galled-Tree.
A Simple Observation
In any network N for M, all sites from the same non-trivial connected component of G(M) must appear together in a single blob in N.
The Decomposition Theorem
Theorem: For any set of sequences M, there is a phylogenetic network that derives M, where each blob contains all and only the sites in one non-trivial connected component of G(M). The compatible sites can always be put on edges outside of any blob. This
“fully-decomposed” network is the finest decomposition possible.
Example: Network for input M with one blob
00000
52
3
3
4S
p
PS
1
4
a:00010
b:10010
c:00100
10010
01100
d:10100
e:01100
00101
01101
f:01101
g:00101
00100
00010
4
1
3
2 5
a: 00010
b: 10010
d: 10100
c: 00100
e: 01100
f: 01101
g: 00101
2
4
p s
ps
1 3
4
2 5
Incompatibility GraphThe fully-decomposednetwork for M
Moreover, the backbone tree and the partition of sites into blobs, and thesequences exported from any blob are allinvariant features of the fully-decomposednetworks for M, and can be determined in polynomial-time.
So, we can find a network for M by solvingthe (rooted) recombination minimization problem for each connected component of G(M) separately, and then connect those subnetworks in an invariant way.
The resulting network will be a networkwith the fewest recombination nodes overall fully-decomposed networks for M.
Algorithmically
• Finding the tree part of the blobbed-tree is easy.• Determining the sequences labeling the exterior nodes on any blob is
easy.• Determining a “good” structure inside a blob B is the problem of
generating the sequences of the exterior nodes of B. • It is easy to test whether the exterior sequences on B can be generated
with only a single recombination. The original galled-tree problem is now just the problem of testing whether one single-crossover recombination is sufficient for each blob.
• That can be solved by successively removing each exterior sequence and testing if the remaining sequences can be generated on a perfect phylogeny of the correct form.
Proof Ideas
Let C be a connected component of G(M). Define M[C] as the sequences in M restricted to the sites in C.
0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1
1 2 3 4 5abcdefg
1 3
4
M
a 0 0 1b 1 0 1c 0 1 0d 1 1 0e 0 1 0f 0 1 0g 0 1 0
1 3 4
C1
2 5
C2
abcdefg
0 00 00 00 01 01 10 1
2 5
B1 B2
M[C1] M[C2]
a 0 0 1b 1 0 1c 0 1 0d 1 1 0e 0 1 0f 0 1 0g 01 0
1 3 4abcdefg
0 00 00 00 01 01 10 1
2 5
M[C1] M[C2]
abcdefg
1 0 0 0 1 0 0 00 1 0 0 1 0 0 00 0 1 0 1 0 0 00 0 0 1 1 0 0 00 0 1 0 0 1 0 00 0 1 0 0 0 1 00 0 1 0 0 0 0 1
W
1 2 3 4 5 6 7 8
1234333
5555678
Now for each connected component C in G(M), call each distinct sequence in M[C] a supercharacter, and let W be the indicatormatrix for the supercharacters. So W indicates which rows of Mcontain which particular supercharacters.
Proof Ideas
Lemma: No pair of supercharacters are incompatible.
So by the NASC for a Perfect Phylogeny, there is a unique perfect phylogeny T for W.
Proof Ideas
For each connected component C of G(M), all supercharacters that originate from C label edges in T that are incident with one single node v[C] in T. So, if we expand each node v[C] to be a network that generates the supercharacters from C (the sequences in M[C]), and connect each network correctly to the edges in T, the resulting network is a fully-decomposed blobbed-tree that generates M.
However …
While fully-decomposed networks always exist, they do not necessarily minimize the number of recombination nodes, over all possible networks.
That is, sometimes it pays to put sites from different connected components together on the same blob.
But we can prove several useful sufficient conditionsfor when there is a fully-decomposed network that minimizes thenumber of recombinations, over all possible networks.
The deepest result:Theorem: Let N be a phylogenetic network for input M, let L be the set of sequences that label the nodes of N, and let G(L) be the incompatibility graph for L. If G(L) and G(M) have the same number of connected components, then there is a fully-decomposed network for M with the same number of recombinations as in N.
JCB December 2007
Sufficient Conditions
Corollary
A fully-decomposed network exists thatminimizes the number of recombinations,unless every optimal network uses somerecombination node(s) labeled by sequence(s)not in M, and the addition of those sequencesto M creates an incompatibility between sitesin different components of G(M).
0000003 4
5 1
p4
001000011010
010010
2 6
100001100101
000100
3 5p s
s100010
G(L) has one component. The addition of sequence 100010 reduces the number of components from 2 to 1.
Sequences in M are in black.Sequence 100010 is not in M.
G(M) has two components. Eachrequires two recs, butthis combined network needs only three.
ps
1 2 3 4 5 6
G(M) for the original data
Two components, so two blobs,each blob requires two recombs,by the HK lower bound theorem,so a fully decomposed networks needsat least four recombinations
1 2 3 4 5 6
G(L) created from the original data, and the addition of the newinterior sequence 100010. G(L) has only one connectedcomponent compared to two components for G(M).
A Practical Sufficient Condition
If M can be derived on a network N in whichevery edge contains at mostone site, and every node is labeled with asequence in M, then there is a fully-decomposed network for M whichminimizes the number of recombinations over all possible networks for M.
Another Practical Sufficient Condition
If M can be derived on a network N wherethe number of recombinations equals the(poly-computable) Haplotype Lower Bound, then there is a fully decomposed networkfor M which minimizes the number of recombinations over all possible networks.
Theorem: For any K, there is a dataset where best fully-decomposednetwork uses K recombinations more than optimal.
In that construction, the ratio of the number of recombinationsin the best fully-decomposed network to the optimal is constant asK grows.
Open Question: Construct examples where the show that theratio can be arbitrarily large.
A New Recombination Lower Bound and The Minimum Perfect
Phylogenetic Forest Problem
Yufeng Wu and Dan Gusfield
UC Davis
COCOON’07 July 16, 2007
History Bound (Myers & Griffiths 2003)
000
100
010
011
111
Iterate the following operations1. Remove a column with a single 0 or 12. Remove a duplicate row3. Remove any row
History bound: the minimum number of type-3 operations needed to reduce the matrix to empty
000
100
010
011
00
10
01
01
Empty.
One type-3 operation
00
10
01
M
Graphical interpretation of history bound (HistB)
• Each operation in the history computation corresponds to an operation that deconstructs the optimal, but unknown ARG.
• We deconstruct the optimal ARG by removing tree parts as long as possible; then remove an exposed recombination node; repeat.
• Removing an exposed recombination node in the ARG corresponds to a single type-3 operation. So when deconstructing the optimal ARG, the number of recombination nodes = number of type-3 operations.
• Since the optimal ARG is unknown, the history bound is the minimum
number of type-3 operations needed to make the matrix empty.
4
1
3
2 5
a: 00010
b: 10010
d: 10100
c: 00100
e: 01100
f: 00101g: 00101
2p s
a: 00010b: 10010c: 00100d: 10100
e: 01100
f: 00101g: 00101
Operations on M correspond to operations on the optimal ARG
M
4
1
3
2 5
a: 00010
b: 10010
d: 10100
c: 00100
e: 01100
f: 00101
2p s
a: 00010b: 10010c: 00100d: 10100
e: 01100
f: 00101
12345
Type-2 operation
4
1
3
a: 001
b: 101
d: 110
c: 010
e: 010
f: 010
2p s
a: 001b: 101c: 010d: 110
e: 010
f: 010
134
Type-1 operations
4
1
3
a: 001
b: 101
d: 110
c: 010
2p s
a: 001b: 101c: 010d: 110
134
Type-2 operations
4
1
3
a: 001
b: 101 c: 010
a: 001b: 101c: 010
134
Type-3 operation
Then three more Type-1 operations fully reduce M and the ARG.
History bound
• Initially required trying all n! permutations of the rows to choose the type-3 operations.
• The bound can be computed by DP in O(2n) time (Bafna, Bansal).
• On datasets where it can be computed, the history bound is observed to be higher than (or equal to) all studied lower bounds (about ten of them).
• There is no static definition for what the history bound is -- it is only defined by the algorithms that compute it! The work in this part of the talk comes out of an attempt to find a simple static definition.
Why a static definition matters
• We want a definition of what is being computed, independent of how it is computed, so that we can reason about it and find alternative ways to compute or approximate it.
• For example, with no static definition of the history bound, we don’t know how to formulate an integer linear program to compute it.
Intro. to Forest Bound: Decompose an Optimal ARG to A Forest of Trees,
removing recombination edges
An ARG with three recombinations
After removing recombination edges, four trees result.
The number of trees is precisely the number of recombinations plus one
Idea behind the Forest Bound (FB)
Each tree created in this way contains at mostone occurrence of any site, and each site occursin at most one of the trees. So the trees form aforest of related perfect phylogenies.
Forest Bound
Given a set of sequences M, partition M intothe fewest subsets so that each subset of sequences can be derived on a tree, whereeach site occurs at most once in the forest oftrees. The number of trees, minus one, is a validlower bound on Rmin.
Comparing the Forest Bound (FB) to:
• History Bound (HistB)
• Optimal Haplotype Bound (OhapB): The currently best lower bound that can be computed in practice for biological data.
• Theorem: On any data, OhapB <= FB <= HistB On some data, OhapB < FB < HistBThus the FB is the highest lower bound with a static
definition.
Computing the Forest Bound is NP-Hard
• Optimal haplotype bound is quite good, but NP-hard to compute.
• If the forest bound can be efficiently computable, we do not need to use optimal haplotype bound at all.
• Unfortunately, the forest bound is NP-hard to compute.
• Reduction from Exact-cover-by-3 sets.
Integer Programming Formulation for the Forest Bound
• For sequences with m sites, consider the hypercube all possible 2m sequences.
• Minimizing F is equivalent to reducing the number of Steiner nodes in the forests.
• We also need to ensure the edge linking two nodes in a tree is only labeled with columns that do not appear in other trees.
• Can easily incorporate the missing data in the input.• The IP formulation has exponential size, but practical when the
number of columns is relatively small.
Empirical Results
• On random generated dataset with 15 rows and 7 columns, FB > OhapB on 10% of the data. On more biological meaningful data (generated with simulation program ms), however, OhapB= FB more often.
• On dataset generated by ms with missing entries, FB is more often outperforms an approximate optimal Rh bound:
– 30 rows and 7 columns and 30% missing entries: FB was strictly larger in 8% of the data.
– When the level of missing entries is lower, the approx. OhapB matches the FB more often.
Open Problem
Find a static definition of the history bound, one that can be translated into an
objective function independent of any algorithm; one that can be solved by ILP,
for example.
Papers and software are at:
wwwcsif.cs.ucdavis.edu/~gusfield
Thank you.