A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin
Jan 13, 2016
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks
Luay NakhlehDepartment of Computer Sciences
UT Austin
Who’s Involved
– UT CS: Tandy Warnow, Luay Nakhleh– UT BIO: Randy Linder – UNM CS: Bernard Moret
Why Networks?
• Lateral gene transfer (LGT)– Ochman estimated that 755 of 4,288 ORF’s in
E.coli were from at least 234 LGT events
• Hybridization– Estimates that as many as 30% of all plant
lineages are the products of hybridization– Fish– Some frogs
Phylogenetic Networks
• Rooted, directed, acyclic graphs that actually model the evolutionary process
• “tree” nodes and “network” nodes
• Time constraints
Separate Analysis
• Analyze individual genes separately
• Reconcile the resulting phylogenies
• As opposed to combined analysis in which the datasets are combined (via concatenation) and the combined dataset is then analyzed
SPR Distances Among Gene Trees
A B C D E
A B C D E A B C D E
SPR Distance 1
Maddison’s Method
Given two gene datasets
• Construct two gene trees T1 and T2
• If SPR(T1,T2)=0– Return a tree
• If SPR(T1,T2)=1– Return a network with one reticulation event
Open problem: extend to reconstructing a network with m reticulation events
Challenges
(1) Computational
– Computing SPR distances is of unknown computational complexity (probably hard)
Solving the Computational Challenge
• Galled-networks: reticulation events are independent
• For two gene trees T1 and T2 on n leaves we can– Decide whether SPR(T1,T2)=m in O(mn)
time, and – Construct network N from T1 and T2 in O(mn)
time
Challenges
(2) Systematic
– Obtaining the correct gene trees in practice is very hard (due to missing data, inaccuracy of tree reconstruction methods, wrong assumptions, etc.)
Solving the Systematic Challenge: Our Method SpNet
Given the sequences of two genes I & II on a set of species
• Run MP or ML on gene I and obtain a set U1 of trees, represented by its consensus tree t1
• Run MP or ML on gene II and obtain a set U2 of trees, represented by its consensus tree t2
• Find binary trees T1 and T2, that refine t1 and t2, respectively, and such that SPR(T1,T2)=1
• Build network N from T1 and T2
SpNet: Running Time
• We have a linear-time algorithm for the single hybrid case (implementation and experimental results are available as well)
• We are working on the general case of arbitrary number of reticulation events
Experimental Study
• Generated random networks on 10 and 20 taxa, with 0, 1, and 2 hybrids
• Evolved sequences under the GTR+Gamma model of evolution with invariant sites
• We studies the topological accuracy based on the splits defined by the model and inferred network
Evaluation Criteria
• Detection Quality– How often did the method infer the correct
number of hybrids in the model phylogeny?
• Reconstruction Quality– What is the topological accuracy of the
inferred phylogeny?
Methods
• SpNet(i): Our method where we contract i edges
• NNet: The method of Bryant and Moulton
• NJ
Detection Quality of SpNetModel Phylogeny: 20-taxon Tree
Detection Quality of SpNetModel Phylogeny: 20-taxon 1-hybrid network
Detection Quality of SpNetModel Phylogeny: 20-taxon 2-hybrid network
Reconstruction QualityModel Phylogeny: 20-taxon tree
Reconstruction QualityModel Phylogeny: 20-taxon 1-hybrid network
Reconstruction QualityModel Phylogeny: 20-taxon 1-hybrid network
Conclusions
• Considering a set of “good” trees rather than a single optimal tree is advantageous in network reconstruction
• Separate analysis approaches outperform combined analysis approaches
Ongoing research
• Using other techniques for obtaining unresolved trees (e.g., Bayesian analyses, bootstrapping, etc.)
• Detection vs. reconstruction – visualization and clustering techniques may also be useful (collaboration with St John)
• Refining unresolved networks
• DCM-like network reconstruction