Top Banner
Understanding sets of trees CS 394C September 10, 2009
29

Understanding sets of trees CS 394C September 10, 2009.

Jan 19, 2016

Download

Documents

Molly Hawkins
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Understanding sets of trees CS 394C September 10, 2009.

Understanding sets of trees

CS 394C

September 10, 2009

Page 2: Understanding sets of trees CS 394C September 10, 2009.

Basic challenge

• Phylogenetic analyses are sometimes based upon a single marker, but often based upon many markers

• Each marker can be analyzed separately, or the entire set can be combined into one “super-matrix”

• Each matrix (each dataset) can result in many trees (almost no matter how you analyze the matrix)

What to do with huge numbers of trees?

Page 3: Understanding sets of trees CS 394C September 10, 2009.

What to do?

• How to estimate evolutionary history from many trees

• How to efficiently store large sets of trees

• How to enable efficient queries of the set of trees

Page 4: Understanding sets of trees CS 394C September 10, 2009.

What to do?

• How to estimate evolutionary history from many trees

• How to efficiently store large sets of trees

• How to enable efficient queries of the set of trees

Page 5: Understanding sets of trees CS 394C September 10, 2009.

First, a few questions:

• Why are gene trees different from the species tree?

• Why are estimated gene trees different from the true gene tree?

• Under what conditions is the true evolutionary history not a tree? (i.e., what is “reticulation”?)

Page 6: Understanding sets of trees CS 394C September 10, 2009.

Reticulation

• Evolutionary histories can be reticulate (meaning non-treelike):– Horizontal Gene Transfer (HGT)– Hybrid speciation– Recombination

• Most phylogeny estimation methods produce trees.

• Good resource about reticulate phylogenies: book chapter by Luay Nakhleh (see 394C webpage for the link)

Page 7: Understanding sets of trees CS 394C September 10, 2009.

• We will assume that all evolutionary histories are treelike for the remainder of today’s presentation.

• Later in the course we’ll discuss reticulate evolution…

Page 8: Understanding sets of trees CS 394C September 10, 2009.

Estimated Gene Trees can differ from Species Trees

• Biological reasons:– Deep coalescent events (alleles)– Gene duplication and loss (gene families)

• Computational reasons: – Insufficient time– Poor methods (e.g., UPGMA)– Poor models (e.g., ML using Jukes-Cantor)

• Data issues:– Insufficient data (meaning not enough sites)– Poor alignments

Page 9: Understanding sets of trees CS 394C September 10, 2009.

Examples of problems

When true gene trees can differ from species tree:• Given a collection of gene trees, find a species tree

that minimizes the number of “deep coalescent” events

When true gene trees should equal the species tree:• Given a collection of gene trees, find a species tree

that minimizes the total distance to the gene trees

Page 10: Understanding sets of trees CS 394C September 10, 2009.

When gene trees can differ from species tree

Software/Algorithms for deep-coalescent (see PhyloNet from Nakhleh’s webpage at Rice)

GLASS (Roch and Mossel) - distance-basedMDC (Than and Nakhleh) - parsimony

STEM (Kubatko) - ML

BEST (Liu et al.) - Bayesian

BUCKy (Ané et al.) - Bayesian

Software/Algorithms for duplication-loss

NOTUNG (Durand)

Duptree (Bansal et al.)

Hallet and Lagergren - algorithms/complexity

Page 11: Understanding sets of trees CS 394C September 10, 2009.

When gene trees should equal the species tree

• The problem here is that estimated gene trees can differ from the true gene trees.

• Although the problem is “simple”, it is still interesting -- computationally and mathematically.

• Plus, we can still make novel contributions.

Page 12: Understanding sets of trees CS 394C September 10, 2009.

The very simplest problem

Easiest case:• One species tree, true gene trees will agree with the

species tree, • Estimated trees are on the full set of taxa

Approaches:Consensus methods: return a tree on the entire set S of taxa

summarizing the input treesAgreement methods: return a tree on a subset of the taxa on

which the trees agreeClustering, then consensus/agreement

Page 13: Understanding sets of trees CS 394C September 10, 2009.

Consensus methods

• These are the most usual ways of analyzing datasets of trees

• Examples:– Strict consensus– Majority consensus– Greedy consensus (aka “extended majority”)– Others less frequently used include: Gordon’s,

Adams, the Strict Consensus Supertree, Local Consensus methods, and more.

• Survey paper by David Bryant for some of these

Page 14: Understanding sets of trees CS 394C September 10, 2009.

Simplest problems, cont.

• “Agreement” methods return trees on subsets of S, on which the trees are the same (or compatible)– MAST: maximum agreement subtree (used in

practice, sometimes)

– MCST: maximum compatible subtree (Ganapathy et al., not used in practice)

• The difference between these is how polytomies are handled

Page 15: Understanding sets of trees CS 394C September 10, 2009.

Soft vs. hard polytomies

• Polytomy: node of high degree (greater than three for an unrooted tree)

• Polytomies arise in estimations when consensus methods are used

• Polytomies also arise when contracting short branches in estimated trees

• Polytomies can be “hard” (representing true radiations) or “soft” (representing lack of information)

Page 16: Understanding sets of trees CS 394C September 10, 2009.

Compatible source trees

• Estimated trees can be “compatible” when we interpret polytomies as “soft”

• “Compatible” means that there is a tree which is a common refinement.

• Example: 123|456, 12|3456, 1235|46.

• We can compute the compatibility tree (when it exists) in O(nk) time, where n=|S| and there are k source trees

Page 17: Understanding sets of trees CS 394C September 10, 2009.

Computational complexity

• Most consensus methods (which return a tree on the entire set S of taxa) are polynomial time.

• Most “agreement methods” (which return a tree on the largest subset of the taxa on which the source trees “agree”) are based upon NP-hard problems. Some (e.g., MAST) have fixed-parameter polynomial time solutions.

Page 18: Understanding sets of trees CS 394C September 10, 2009.

Supertree problems

• Realistic complexity: not all the source trees are on the same set of taxa.

• Obvious problems: – Find the tree on which all the source trees

agree (if it exists).– Find the tree on which a maximum number

of the source trees agree.

• Both are NP-hard.

Page 19: Understanding sets of trees CS 394C September 10, 2009.

Quartet compatibility

• Simple case: all the source trees are on four taxa.

• We ask: does there exist a tree which agrees with all the source trees?

• NP-hard!

Page 20: Understanding sets of trees CS 394C September 10, 2009.

Quartet tree amalgamation

• Given collection of quartet trees, find a tree which agrees with a maximum number of these quartet trees

NP-hard, since compatibility is NP-hardHard to approximate, but PTAS if you

have a tree on every quartet of taxa (Jiang et al.)

Page 21: Understanding sets of trees CS 394C September 10, 2009.

Quartet amalgamation algorithms

• Quartet Puzzling (Strimmer and von Haeseler)

• Q* (Berry et al.)

• Quartet Cleaning (Berry et al.)

• Weight Optimization (Ranwez and Gascuel)

• Quartets MaxCut (Snir and Rao)

But see also the paper (St. John et al.) evaluating early quartet methods on the CS 394C webpage

Page 22: Understanding sets of trees CS 394C September 10, 2009.

What about rooted trees?

Given set of rooted source trees, we ask:

• Is there a tree on which all the rooted source trees are correct?

Page 23: Understanding sets of trees CS 394C September 10, 2009.

Rooted tree compatibility

• Aho, Sagiv, Szymanski, and Ullman: polynomial time, recursive algorithm:– If n=1, return the singleton tree.– If n>1, then compute an equivalence relation on the

set of taxa as follows. • For each rooted triple ((a,b),c) in the set, put a and b in the

same equivalence class. • Compute transitive closure.

– If only one equivalence class, reject (set is incompatible). Otherwise, recurse on each subset, and return tree obtained by making all recursively computed trees sibling subtrees.

Page 24: Understanding sets of trees CS 394C September 10, 2009.

Subtree compatibility

• If source trees are rooted, then compatibility can be tested in polynomial time. Optimization problems are NP-hard, however.

• If source trees are unrooted, then compatibility is NP-hard. And so optimization problems are also NP-hard.

Page 25: Understanding sets of trees CS 394C September 10, 2009.

Supertree problems, in practice

• In practice, the most frequently used supertree method is MRP, for “Matrix Representation with Parsimony”.

• There are, however, many other supertree methods!

Page 26: Understanding sets of trees CS 394C September 10, 2009.

Many Supertree Methods

• MRP• weighted MRP• Min-Cut• Modified Min-Cut• Semi-strict Supertree• MRF• MRD• QILI

• SDM• Q-imputation• PhySIC• Majority-Rule

Supertrees• Maximum Likelihood

Supertrees• and many more ...

Matrix Representation with Parsimony(Most commonly used)

Page 27: Understanding sets of trees CS 394C September 10, 2009.

MRP

• Idea: take every sourcetree, and replace it with a matrix of 0,1,?.

• Concatenate the matrices.• Apply Maximum Parsimony.

If all the source trees are compatible, then an exact solution to MRP will return the compatibility trees.

Page 28: Understanding sets of trees CS 394C September 10, 2009.

Homework, due 9/15

• Read two papers (linked on the webpage):– St. John et al., about quartet-based

methods– Moret et al., about sequence-length

requirements

• Pick one, write summary, and include questions

Page 29: Understanding sets of trees CS 394C September 10, 2009.

Question!

• How do you feel about occasionally having class on some Monday or Friday, so we can have guest lectures?