ReCombinatorics: Phylogenetic Networks with Recombination CPM, June 18, 2008 Pisa, Italy Two recent results and Two Open Questions.

ReCombinatorics: Phylogenetic Networks with Recombination

CPM, June 18, 2008

Pisa, Italy

Two recent results and Two Open Questions

What is population genomics?

• The Human genome “sequence” is done.• Now we want to sequence many individuals

in a population to correlate similarities and differences in their sequences with genetic traits (e.g. disease or disease susceptibility).

• Presently, we can’t sequence large numbers of individuals, but we can sample the sequences at SNP sites.

SNP Data

• A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).

• SNP maps have been compiled with a density of about 1 site per 1000.

• SNP data is what is mostly collected in populations - it is much cheaper to collect than full sequence data, and focuses on variation in the population, which is what is of interest.

Haplotype Map Project: HAPMAP

• NIH lead project ($100M) to find common SNP haplotypes (“SNP sequences”) in the Human population.

• Association mapping: HAPMAP used to try to associate genetic-influenced diseases with specific SNP haplotypes, to either find causal haplotypes, or to find the region near causal mutations.

• The key to the logic of Association mapping is historical recombination in populations. Nature has done the experiments, now we try to make sense of the results.

The Perfect Phylogeny Model for SNP sequences

00000

1

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral sequence

Extant sequences at the leaves

Site mutations on edgesThe tree derives the set M:1010010000010110101000010

Only one mutation per siteallowed.

Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs:

0,0 and 0,1 and 1,0 and 1,1

This is the 4-Gamete Test

When can a set of sequences be derived on a perfect phylogeny?

A richer model

00000

1

2

4

3

510100

1000001011

00010

01010

12345101001000001011010100001010101 added

Pair 4, 5 fails the fourgamete-test. The sites 4, 5are incompatible.

Real sequence histories often involve recombination.

M

10100 01011

5

10101

The first 4 sites come from P (Prefix) and the sitesfrom 5 onward come from S (Suffix).

P S

Sequence Recombination

A recombination of P and S at recombination point 5.

Single crossover recombination

Network with Recombination: ARG

00000

1

2

4

3

510100

1000001011

00010

01010

12345101001000001011010100001010101 new

10101

The previous tree with onerecombination event now derivesall the sequences.

5

P

S

M

A Min ARG for Kreitman’s data

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

ARG created by SHRUB

An illustration of why we are interested in recombination:

Association Mapping of Complex Diseases Using

ARGs

Association Mapping

• A major strategy being practiced to find genes influencing disease from haplotypes of a subset of SNPs.– Disease mutations: unobserved.

• A simple example to explain association mapping and why ARGs are useful, assuming the true ARG is known.

0 1 0 0 1

Disease mutation site

SNPs

00000

52

3

3

4S

P

PS

1

4

a:00010

b:10010

c:00100

10010

01100

d:10100

e:01100

00101

01101

f:01101

g:00101

00100

00010

Very Simplistic Mapping the Unobserved Mutation of Mendelian Diseases with ARGs

Diseased

Assumption (for now): A sequence is diseased iff it carries the single disease mutation

Where is the disease mutation?

1 2 3 4 5

What part of 01100 d, e, f inherit?

d: e:f:

? ?

The single disease mutation occurs between sites 2 and 3!

Mapping Disease Gene with Inferred ARGs

• “..the best information that we could possibly get about association is to know the full coalescent genealogy…” – Zollner and Pritchard, 2005

• But we do not know the true ARG! • Goal: infer ARGs from SNP data for

association mapping– Not easy and often approximation (e.g. Zollner and

Pritchard)– Improved results to do the inference Y. Wu (RECOMB 2007)

Results on Reconstructing the Evolution of SNP Sequences

• Part I: Clean mathematical and algorithmic results: Galled-Trees, near-uniqueness, graph-theory lower bound, and the Decomposition theorem, Forest Theorem and the History Lower bound.

• Part II: Practical computation of Lower and Upper bounds on the number of recombinations needed. Construction of (optimal) phylogenetic networks; uniform sampling; haplotyping with ARGs; LD mapping …

• Part III: Varied Biological Applications• Part IV: Extension to Gene Conversion• Part V: The Minimum Mosaic Model of Recombination (CPM 2007)

This talk will discuss two topics in Part I

Minimizing Recombinations in unconstrained networks

• Problem: given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations used to generate M, allowing only one mutation per site. This has biological meaning in appropriate contexts.

• The general minimization problem is NP-hard.

• We can solve this problem in poly-time for the special case of Galled-Trees, to be defined.

The Decomposition Theorem

Since the minimization problem is NP-hardwe want to break up a problem into subproblems that can be solved separately and combined.

Incompatible Sites

A pair of sites (columns) of M that fail the

4-gametes test are said to be incompatible.

A site that is not in such a pair is compatible.

0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1

1 2 3 4 5abcdefg

1 3

4

2 5

Two nodes are connected iff the pairof sites are incompatible, i.e, fail the 4-gamete test.

Incompatibility Graph G(M)

M

THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.

The connected components of G(M) are very informative

For example we have the Theorem:

The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network. We will see

that the non-trivial connected components are the key to the finestpossible decomposition, and have other essential uses.

Recombination Cycles

• In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet.

• The cycle specified by those two paths is called a ``recombination cycle”.

A maximal set of intersecting cycles forms a Blob

00000

52

3

3

4S

p

PS

1

4

10010

0110000101

01101

00100

00010

If directions on the edges are removed, a blob isa bi-connected component of the network.

Blobed Trees

• Contracting each blob in a network results in a directed, rooted tree, otherwise one of the “blobs” was not maximal. Simple, but key insight.

• So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree.

The blobs are the non-tree-like parts of the network.

A blob that is just a single cycle is called a “gall”, and a network where all blobs are galls is called a ``Galled-Tree”.

Ugly tanglednetwork insidethe blob.

Every network is a tree of blobs.

A network where every blob is a single cycle is a Galled-Tree.

A Simple Observation

In any network N for M, all sites from the same non-trivial connected component of G(M) must appear together in a single blob in N.

The Decomposition Theorem

Theorem: For any set of sequences M, there is a phylogenetic network that derives M, where each blob contains all and only the sites in one non-trivial connected component of G(M). The compatible sites can always be put on edges outside of any blob. This

“fully-decomposed” network is the finest decomposition possible.

Example: Network for input M with one blob

00000

52

3

3

4S

p

PS

1

4

a:00010

b:10010

c:00100

10010

01100

d:10100

e:01100

00101

01101

f:01101

g:00101

00100

00010

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 01101

g: 00101

2

4

p s

ps

1 3

4

2 5

Incompatibility GraphThe fully-decomposednetwork for M

Moreover, the backbone tree and the partition of sites into blobs, and thesequences exported from any blob are allinvariant features of the fully-decomposednetworks for M, and can be determined in polynomial-time.

So, we can find a network for M by solvingthe (rooted) recombination minimization problem for each connected component of G(M) separately, and then connect those subnetworks in an invariant way.

The resulting network will be a networkwith the fewest recombination nodes overall fully-decomposed networks for M.

Algorithmically

• Finding the tree part of the blobbed-tree is easy.• Determining the sequences labeling the exterior nodes on any blob is

easy.• Determining a “good” structure inside a blob B is the problem of

generating the sequences of the exterior nodes of B. • It is easy to test whether the exterior sequences on B can be generated

with only a single recombination. The original galled-tree problem is now just the problem of testing whether one single-crossover recombination is sufficient for each blob.

• That can be solved by successively removing each exterior sequence and testing if the remaining sequences can be generated on a perfect phylogeny of the correct form.

Proof Ideas

Let C be a connected component of G(M). Define M[C] as the sequences in M restricted to the sites in C.

0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1

1 2 3 4 5abcdefg

1 3

4

M

a 0 0 1b 1 0 1c 0 1 0d 1 1 0e 0 1 0f 0 1 0g 0 1 0

1 3 4

C1

2 5

C2

abcdefg

0 00 00 00 01 01 10 1

2 5

B1 B2

M[C1] M[C2]

a 0 0 1b 1 0 1c 0 1 0d 1 1 0e 0 1 0f 0 1 0g 01 0

1 3 4abcdefg

0 00 00 00 01 01 10 1

2 5

M[C1] M[C2]

abcdefg

1 0 0 0 1 0 0 00 1 0 0 1 0 0 00 0 1 0 1 0 0 00 0 0 1 1 0 0 00 0 1 0 0 1 0 00 0 1 0 0 0 1 00 0 1 0 0 0 0 1

W

1 2 3 4 5 6 7 8

1234333

5555678

Now for each connected component C in G(M), call each distinct sequence in M[C] a supercharacter, and let W be the indicatormatrix for the supercharacters. So W indicates which rows of Mcontain which particular supercharacters.

Proof Ideas

Lemma: No pair of supercharacters are incompatible.

So by the NASC for a Perfect Phylogeny, there is a unique perfect phylogeny T for W.

Proof Ideas

For each connected component C of G(M), all supercharacters that originate from C label edges in T that are incident with one single node v[C] in T. So, if we expand each node v[C] to be a network that generates the supercharacters from C (the sequences in M[C]), and connect each network correctly to the edges in T, the resulting network is a fully-decomposed blobbed-tree that generates M.

However …

While fully-decomposed networks always exist, they do not necessarily minimize the number of recombination nodes, over all possible networks.

That is, sometimes it pays to put sites from different connected components together on the same blob.

But we can prove several useful sufficient conditionsfor when there is a fully-decomposed network that minimizes thenumber of recombinations, over all possible networks.

The deepest result:Theorem: Let N be a phylogenetic network for input M, let L be the set of sequences that label the nodes of N, and let G(L) be the incompatibility graph for L. If G(L) and G(M) have the same number of connected components, then there is a fully-decomposed network for M with the same number of recombinations as in N.

JCB December 2007

Sufficient Conditions

Corollary

A fully-decomposed network exists thatminimizes the number of recombinations,unless every optimal network uses somerecombination node(s) labeled by sequence(s)not in M, and the addition of those sequencesto M creates an incompatibility between sitesin different components of G(M).

0000003 4

5 1

p4

001000011010

010010

2 6

100001100101

000100

3 5p s

s100010

G(L) has one component. The addition of sequence 100010 reduces the number of components from 2 to 1.

Sequences in M are in black.Sequence 100010 is not in M.

G(M) has two components. Eachrequires two recs, butthis combined network needs only three.

ps

1 2 3 4 5 6

G(M) for the original data

Two components, so two blobs,each blob requires two recombs,by the HK lower bound theorem,so a fully decomposed networks needsat least four recombinations

1 2 3 4 5 6

G(L) created from the original data, and the addition of the newinterior sequence 100010. G(L) has only one connectedcomponent compared to two components for G(M).

A Practical Sufficient Condition

If M can be derived on a network N in whichevery edge contains at mostone site, and every node is labeled with asequence in M, then there is a fully-decomposed network for M whichminimizes the number of recombinations over all possible networks for M.

Another Practical Sufficient Condition

If M can be derived on a network N wherethe number of recombinations equals the(poly-computable) Haplotype Lower Bound, then there is a fully decomposed networkfor M which minimizes the number of recombinations over all possible networks.

Theorem: For any K, there is a dataset where best fully-decomposednetwork uses K recombinations more than optimal.

In that construction, the ratio of the number of recombinationsin the best fully-decomposed network to the optimal is constant asK grows.

Open Question: Construct examples where the show that theratio can be arbitrarily large.

A New Recombination Lower Bound and The Minimum Perfect

Phylogenetic Forest Problem

Yufeng Wu and Dan Gusfield

UC Davis

COCOON’07 July 16, 2007

History Bound (Myers & Griffiths 2003)

000

100

010

011

111

Iterate the following operations1. Remove a column with a single 0 or 12. Remove a duplicate row3. Remove any row

History bound: the minimum number of type-3 operations needed to reduce the matrix to empty

000

100

010

011

00

10

01

01

Empty.

One type-3 operation

00

10

01

M

Graphical interpretation of history bound (HistB)

• Each operation in the history computation corresponds to an operation that deconstructs the optimal, but unknown ARG.

• We deconstruct the optimal ARG by removing tree parts as long as possible; then remove an exposed recombination node; repeat.

• Removing an exposed recombination node in the ARG corresponds to a single type-3 operation. So when deconstructing the optimal ARG, the number of recombination nodes = number of type-3 operations.

• Since the optimal ARG is unknown, the history bound is the minimum

number of type-3 operations needed to make the matrix empty.

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 00101g: 00101

2p s

a: 00010b: 10010c: 00100d: 10100

e: 01100

f: 00101g: 00101

Operations on M correspond to operations on the optimal ARG

M

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 00101

2p s

a: 00010b: 10010c: 00100d: 10100

e: 01100

f: 00101

12345

Type-2 operation

4

1

3

a: 001

b: 101

d: 110

c: 010

e: 010

f: 010

2p s

a: 001b: 101c: 010d: 110

e: 010

f: 010

134

Type-1 operations

4

1

3

a: 001

b: 101

d: 110

c: 010

2p s

a: 001b: 101c: 010d: 110

134

Type-2 operations

4

1

3

a: 001

b: 101 c: 010

a: 001b: 101c: 010

134

Type-3 operation

Then three more Type-1 operations fully reduce M and the ARG.

History bound

• Initially required trying all n! permutations of the rows to choose the type-3 operations.

• The bound can be computed by DP in O(2n) time (Bafna, Bansal).

• On datasets where it can be computed, the history bound is observed to be higher than (or equal to) all studied lower bounds (about ten of them).

• There is no static definition for what the history bound is -- it is only defined by the algorithms that compute it! The work in this part of the talk comes out of an attempt to find a simple static definition.

Why a static definition matters

• We want a definition of what is being computed, independent of how it is computed, so that we can reason about it and find alternative ways to compute or approximate it.

• For example, with no static definition of the history bound, we don’t know how to formulate an integer linear program to compute it.

Intro. to Forest Bound: Decompose an Optimal ARG to A Forest of Trees,

removing recombination edges

An ARG with three recombinations

After removing recombination edges, four trees result.

The number of trees is precisely the number of recombinations plus one

Idea behind the Forest Bound (FB)

Each tree created in this way contains at mostone occurrence of any site, and each site occursin at most one of the trees. So the trees form aforest of related perfect phylogenies.

Forest Bound

Given a set of sequences M, partition M intothe fewest subsets so that each subset of sequences can be derived on a tree, whereeach site occurs at most once in the forest oftrees. The number of trees, minus one, is a validlower bound on Rmin.

Comparing the Forest Bound (FB) to:

• History Bound (HistB)

• Optimal Haplotype Bound (OhapB): The currently best lower bound that can be computed in practice for biological data.

• Theorem: On any data, OhapB <= FB <= HistB On some data, OhapB < FB < HistBThus the FB is the highest lower bound with a static

definition.

Computing the Forest Bound is NP-Hard

• Optimal haplotype bound is quite good, but NP-hard to compute.

• If the forest bound can be efficiently computable, we do not need to use optimal haplotype bound at all.

• Unfortunately, the forest bound is NP-hard to compute.

• Reduction from Exact-cover-by-3 sets.

Integer Programming Formulation for the Forest Bound

• For sequences with m sites, consider the hypercube all possible 2m sequences.

• Minimizing F is equivalent to reducing the number of Steiner nodes in the forests.

• We also need to ensure the edge linking two nodes in a tree is only labeled with columns that do not appear in other trees.

• Can easily incorporate the missing data in the input.• The IP formulation has exponential size, but practical when the

number of columns is relatively small.

Empirical Results

• On random generated dataset with 15 rows and 7 columns, FB > OhapB on 10% of the data. On more biological meaningful data (generated with simulation program ms), however, OhapB= FB more often.

• On dataset generated by ms with missing entries, FB is more often outperforms an approximate optimal Rh bound:

– 30 rows and 7 columns and 30% missing entries: FB was strictly larger in 8% of the data.

– When the level of missing entries is lower, the approx. OhapB matches the FB more often.

Open Problem

Find a static definition of the history bound, one that can be translated into an

objective function independent of any algorithm; one that can be solved by ILP,

for example.

Papers and software are at:

wwwcsif.cs.ucdavis.edu/~gusfield

Thank you.

ReCombinatorics: Phylogenetic Networks with Recombination CPM, June 18, 2008 Pisa, Italy Two recent results and Two Open Questions.

Documents