Large-Scale Phylogenetic Large-Scale Phylogenetic Analysis Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director The Center for Computational Biology and Bioinformatics The University of Texas at Austin
46
Embed
Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Hence NJ is statistically consistent for many modelsof evolution.
But what about performance on finite sequence lengths?
We focus on performance on finite We focus on performance on finite sequence lengthssequence lengths
Absolute fast convergence vs. Absolute fast convergence vs. exponential convergenceexponential convergence
General Markov (GM) ModelGeneral Markov (GM) Model
• A GM model tree is a pair where
– is a rooted binary tree.
– , and is a
stochastic substitution matrix with
.– The sequence at the root of is drawn
from a uniform distribution.– the rates of evolution across the sites can
be drawn from a fixed distribution• GM contains models like Jukes-Cantor (JC) and
Kimura 2-Parameter (K2P) models.
)( MT,
)}(:)({ TEeeM M
T
)(eM
1,0))(det( eM
T
Absolute Fast ConvergenceAbsolute Fast Convergence
• Let . Define . We parameterize the GM model:
• A phylogenetic reconstruction method is absolute fast-converging (AFC) for the GM model if for all positive there is a polynomial such that for all on set of sequences of length at least generated on , we have
0, gf |)det(|log)( eMe
})(),(:),{(, gefTEeGMTGM gf M
,, gfp gfGMT ,),( M
S n )(npT
1])(Pr[ TS
Theoretical Comparison of Early AFC Theoretical Comparison of Early AFC Methods to NJMethods to NJ
• Theorem 1 [Warnow et al. 2001]DCMNJ+SQS is absolute fast converging for the GM model.
• Theorem 2 [Csűrös 2001]HGT+FP is absolute fast converging for the GM model.
• Theorem 3 [Atteson 1999]NJ is exponentially converging for the GM model (but is not known to be AFC).
DCM-Boosting DCM-Boosting [Warnow et al. 2001][Warnow et al. 2001]
• DCM+SQS is a two-phase procedure which reduces the sequence length requirement of methods.
DCM SQSExponentiallyconvergingmethod
Absolute fast convergingmethod
• DCMNJ+SQS is the result of DCM-boosting NJ.
Experimental Comparison of Early Experimental Comparison of Early AFC Methods to NJAFC Methods to NJ
Measured Distance vs. Measured Distance vs. Actual Number of EventsActual Number of Events
Breakpoint Distance Inversion Distance
120 genes, inversion-only evolution
Generalized Nadeau-Taylor ModelGeneralized Nadeau-Taylor Model
• Three types of events: – Inversions – Transpositions– Inverted Transpositions
• Events of the same type are equiprobable• Probability of the three types have fixed
ratio: Inv : Trp : Inv.Trp = (1--)::
Estimating True Evolutionary Estimating True Evolutionary Distances for GenomesDistances for Genomes
Given fixed probabilities for each type of event, we estimate the expected breakpoint distance after k random events:
• Approx-IEBP [Wang, Warnow 2001]
– Polynomial-time closed-form approximation to the expected breakpoint distance
– Proven error bound• Exact-IEBP [Wang 2001]
– Exact, recursive solution for the expected breakpoint distance
– Polynomial-time but slower than Approx-IEBP
Estimating True Evolutionary Estimating True Evolutionary Distances for Genomes (cont.)Distances for Genomes (cont.)
Estimating the expected Inversion distance:
EDE [Moret, Wang, Warnow, Wyman 2001]
– Closed-form formula based upon an empirical estimation of the expected inversion distance after k random events (based upon 120 genes and inversion only, but robust to errors in the model) .
– Polynomial time, fastest of the three.
Goodness of fit for Approx-IEBPGoodness of fit for Approx-IEBP
•120 genes•Inversion-only evolution (similar perfor- mance under other models)•EDE and Exact-IEBP have similar performance
Approx-
Absolute DifferenceAbsolute Difference
•120 genes•Inversion only evolution (Similar relative performance under other models)
Accuracy of Neighbor Joining Accuracy of Neighbor Joining Using Distance EstimatorsUsing Distance Estimators
•120 genes•Inversion-only evolution •10, 20, 40, 80, and 160 genomes•Similar relative performance under other models
Accuracy of Neighbor Joining Accuracy of Neighbor Joining Using Distance EstimatorsUsing Distance Estimators
•120 genes•All three event types equiprobable•10, 20, 40, 80, and 160 genomes•Similar relative performance under other models
Summary of Genomic Summary of Genomic Distance EstimatorsDistance Estimators
• Statistically based estimation of genomic distances improves NJ analyses
• Our IEBP estimators assume knowledge of the probabilities of each type of event, but are robust to model violations
• NJ(EDE) outperforms NJ on other estimators, under all models studied
• Accuracy is very good, except when very close to saturation
Maximum Parsimony on Maximum Parsimony on Rearranged Genomes (MPRG)Rearranged Genomes (MPRG)
• The leaves are rearranged genomes.• Find the tree that minimizes the total number of rearrangement events
A
B
C
D
3 6
2
3
4
A
B
C
D
EF
Total length= 18
GRAPPA GRAPPA [Bader et al., PSB’01][Bader et al., PSB’01]
(Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithms)
Reimplementation of BPAnalysis [Blanchette et al. 1997] for the Breakpoint Phylogeny problem.
• Uses algorithm engineering to improve performance.
• Improves the algorithm by reducing the number of tree length evaluations. (Evaluating the length of a fixed tree is NP-hard)
CampanulaceaeCampanulaceae
Analysis of Analysis of CampanulaceaeCampanulaceae
Using GRAPPA v1.1 on the 512-processor Los Lobos Supercluster machine:
2 minutes = 100 million-fold speedup(200,000-fold speedup per processor)
Consensus of 216 MP TreesConsensus of 216 MP Trees
Strict Consensus of 216 trees;6 out of 10 internal edges recovered.
Trachelium
Campanula
Adenophora
Symphandra
Legousia
Asyneuma
Triodanus
Wahlenbergia
Merciera
Codonopsis
Cyananthus
Platycodon
Tobacco
Future WorkFuture Work
• New focus on Rare Genomic Changes– New data– New models– New methods
• New techniques for large scale analyses– Divide-and-conquer methods– Non-tree models– Visualization of large trees and large sets of
trees
AcknowledgementsAcknowledgements
• Funding: The David and Lucile Packard Foundation, The National Science Foundation, and Paul Angello• Collaborators: Robert Jansen (U. Texas) Bernard Moret, David Bader, Mi-Yan
(U. New Mexico) Daniel Huson (Celera) Katherine St. John (CUNY) Linda Raubeson (Central Washington U.) Luay Nakhleh, Usman Roshan, Jerry Sun,
Li-San Wang, Stacia Wyman (Phylolab, U. Texas)
Phylolab, U. TexasPhylolab, U. Texas
Please visit us athttp://www.cs.utexas.edu/users/phylo/