Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Simulation, Modeling, and Benchmarks Benchmarks U Penn: U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen Cohen U Texas : U Texas : David Hillis, Lauren Meyers David Hillis, Lauren Meyers Eric Miller, Tracy Heath, Derrick Zwickl Eric Miller, Tracy Heath, Derrick Zwickl NC State: NC State: Spencer Muse Spencer Muse Errol Strain Errol Strain Yale: Yale: Paul Turner Paul Turner and and Bernard Moret Bernard Moret Tandy Warnow Tandy Warnow Robert Jensen Robert Jensen Randy Linder Randy Linder
Simulation, Modeling, and Benchmarks. U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen U Texas : David Hillis, Lauren Meyers Eric Miller, Tracy Heath, Derrick Zwickl NC State: Spencer Muse Errol Strain - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Cyberinfrastructure for Phylogenetic Research
Simulation, Modeling, and Simulation, Modeling, and BenchmarksBenchmarks
U Penn:U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Junhyong Kim, Sampath Kannan, Susan Davidson
Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen
U Texas :U Texas : David Hillis, Lauren Meyers David Hillis, Lauren Meyers
Goal:Goal: Develop validated datasets of Develop validated datasets of sufficient complexity and scale to sufficient complexity and scale to realistically benchmark latest tree realistically benchmark latest tree
algorithmsalgorithms
Cyberinfrastructure for Phylogenetic Research
ProblemsProblems
Large-scale simulations is computationally demanding and difficult to reproduce Large-scale simulations is computationally demanding and difficult to reproduce independently independently
The model parameter space explodes in combinatorial complexity with increase The model parameter space explodes in combinatorial complexity with increase in model complexity in model complexity
Large-scale algorithm test experimental design is extremely difficult to manageLarge-scale algorithm test experimental design is extremely difficult to manage Branching structure specification is critical but the standard options are limited Branching structure specification is critical but the standard options are limited
for very large treesfor very large trees Credible simulation model acceptable to the community is difficult to establishCredible simulation model acceptable to the community is difficult to establish
Cyberinfrastructure for Phylogenetic Research
ProblemsProblems
Large-scale simulations is Large-scale simulations is computationally demandingcomputationally demanding and and difficult todifficult to reproducereproduce independently independently
The model parameter space explodes inThe model parameter space explodes in combinatorial complexity combinatorial complexity with with increase in model complexity increase in model complexity
Large-scale algorithm test experimental design is extremely Large-scale algorithm test experimental design is extremely difficult to managedifficult to manage Branching structure specification is criticalBranching structure specification is critical but the standard options are but the standard options are
limited for very large treeslimited for very large trees Credible simulation Credible simulation model acceptable to the communitymodel acceptable to the community is difficult to establish is difficult to establish
Cyberinfrastructure for Phylogenetic Research
Simulation Design Simulation Design
Pre-generate a very large dataset (>10Pre-generate a very large dataset (>1066 positions) over a very large positions) over a very large complex tree (>10complex tree (>1066 taxa) using a suite of complex models of taxa) using a suite of complex models of evolutionevolution
Store the data in a databaseStore the data in a database Retrieve subsets of the data by various sampling schemesRetrieve subsets of the data by various sampling schemes
Query ManagementQuery Management Load queries from the databaseLoad queries from the database Save queries to the databaseSave queries to the database
Import queries from a text fileImport queries from a text file Export queries to a text fileExport queries to a text file
Create local queries (ie not stored in the database)Create local queries (ie not stored in the database) Delete queries from local session and databaseDelete queries from local session and database
Access query objects through the command lineAccess query objects through the command line Manipulate query objects within jython scriptsManipulate query objects within jython scripts
Cyberinfrastructure for Phylogenetic Research
Cyberinfrastructure for Phylogenetic Research
SimulationSimulation
Tree Topology Simulation:Tree Topology Simulation: Generate the temporal Generate the temporal
branching structure of branching structure of populations/speciespopulations/species
Character Simulation:Character Simulation: Generate the evolution of Generate the evolution of
sequences/morphology/etc sequences/morphology/etc over the tree generated over the tree generated above above
Cyberinfrastructure for Phylogenetic Research
Tree Topology Simulation(Tracy Heath and David Hillis, UT Austin)
Standard Approach: Simulate a homogeneous branching
process (e.g., pure-birth model) Sub-sample from a large homogeneous
branching process
Problems: Larger trees are self-similar to smaller
trees Most biologists don’t think trees in
simulations “look” like “real” trees
Cyberinfrastructure for Phylogenetic Research
Tree Topology Simulation Modified code from Phyl-O-Gen - a tree
simulation program (Rambaut)
Birth-death process
After a speciation event, the rates of each daughter lineage are mutated
The new rate is obtained by multiplying the parent rate by a gamma-distributed multiplier centered on 1
The new rate is accepted in proportion to a prior distribution on birth and death rates
Cyberinfrastructure for Phylogenetic Research
Tree Shape
Balanced Imbalanced
Cyberinfrastructure for Phylogenetic Research
Tree Shape
67.2=N 17.3=N83.2=N
47.12 =σ22.02 =σ 81.02 =σ
9.2=EN
33.3=N
22.22 =σ
Expectation under the equal rates Markov (ERM) model
Cyberinfrastructure for Phylogenetic Research
Tree Shape
I = 1
I = 1
I = 0
I = 0.5
I = 1
I = 0
I = 1
I = 1
I = 1
Expectation under the equal rates Markov (ERM) model
I = 0.5
Weighted mean imbalance (I)
Cyberinfrastructure for Phylogenetic Research
Simulated trees were compared with published phylogenies using measures of tree shape. 200 trees of 10000 taxa under constant rates standard model 200 trees of 10000 taxa under variable rates our model
433 trees were collected from various sources and sorted based on the method used to estimate the phylogeny and the proportion of the ingroup sampled.
Weighted mean imbalance (I) was used to compare the simulated trees with published trees
Tree Topology Simulation
Cyberinfrastructure for Phylogenetic Research
Comparing Trees
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9
ln(node size)
wei
gh
ted
mea
n im
bal
ance
(I)
Cyberinfrastructure for Phylogenetic Research
Comparing Trees
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9
wei
gh
ted
mea
n im
bal
ance
(I)
ln(node size)
Cyberinfrastructure for Phylogenetic Research
Comparing Trees
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9
wei
gh
ted
mea
n im
bal
ance
(I)
ln(node size)
Cyberinfrastructure for Phylogenetic Research
Comparing Trees
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9
wei
gh
ted
mea
n im
bal
ance
(I)
ln(node size)
Cyberinfrastructure for Phylogenetic Research
Comparing Trees
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9
wei
gh
ted
mea
n im
bal
ance
(I)
ln(node size)
Cyberinfrastructure for Phylogenetic Research
Comparing Trees
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9
wei
gh
ted
mea
n im
bal
ance
(I)
ln(node size)
Cyberinfrastructure for Phylogenetic Research
Three trees ranging from simple to complex were simulated
Equal rates tree Variable rates tree Variable rates tree with mass extinctions
Million-taxon Trees
Cyberinfrastructure for Phylogenetic Research
Cyberinfrastructure for Phylogenetic Research
Cyberinfrastructure for Phylogenetic Research
Cyberinfrastructure for Phylogenetic Research
Multi-layered simulations for character evolutionMulti-layered simulations for character evolution
Estimate statistical parameters for Estimate statistical parameters for real molecules (e.g., rbcL) using real molecules (e.g., rbcL) using HyPhy, extend model family to HyPhy, extend model family to include more discrete rate include more discrete rate distribution and positional distribution and positional dependencies, and finally generate dependencies, and finally generate a very large tree of 10a very large tree of 1066~10~1077 taxa taxa using the key molecule models as using the key molecule models as its basis.its basis.
rbcL model family estimated under rbcL model family estimated under codon-specific model (Muse)codon-specific model (Muse)
rRNA gene model (including 2rRNA gene model (including 2ndnd structure; Hillis and Gutel)structure; Hillis and Gutel)
rbcLrbcL
invariable sites
=2.1/ = 1.3= (0.1,..,0.2)..
=1.1/ = 1.7= (0.3,..,0.2)..
=0.8/ = 0.5= (0.1,..,0.5)..
Cyberinfrastructure for Phylogenetic Research
Simulation of complex evolutionary processesSimulation of complex evolutionary processes
Reflect more complex dynamics • Heterogeneous rates:
– lineage and site specific mutation rates– genomic context dependent rates
• Phenotypic effects– Selection– Population interaction
Cyberinfrastructure for Phylogenetic Research
RNA and its secondary structure as a model system for genotype-phenotype evolution
Cyberinfrastructure for Phylogenetic Research
Micro-Macro simulation model Micro-Macro simulation model (Meyers, Kim)(Meyers, Kim)
Generate a population of Generate a population of molecules incorporating a fitness molecules incorporating a fitness model and speciation process model and speciation process based on RNA folding. Fitness based on RNA folding. Fitness from (1) similarity to known 16S from (1) similarity to known 16S RNA (~67k seqs); (2) similarity to RNA (~67k seqs); (2) similarity to known 16S structure (~200 crystal known 16S structure (~200 crystal structure); (3) folding stability structure); (3) folding stability
Experimental viral evolution (Turner; Experimental viral evolution (Turner; non-ITR funding for empirical worknon-ITR funding for empirical work))
Use the RNA bacteriophage phi-6 Use the RNA bacteriophage phi-6 system to generate an system to generate an experimental phylogeny (~64-experimental phylogeny (~64-taxon tree with host switching and taxon tree with host switching and horizontal transfer)horizontal transfer)
Cyberinfrastructure for Phylogenetic Research
Individual-based simulation Individual-based simulation (E. Miller and L. Ancel)(E. Miller and L. Ancel)
More fitMore fit
Different adaptive peaksDifferent adaptive peaks
Cyberinfrastructure for Phylogenetic Research
Strategy for macro-evolutionStrategy for macro-evolution
mutationmutation
fixationfixation
Compute probability of fixation of Compute probability of fixation of different mutation types using different mutation types using Kimura’s derivations. Draw Kimura’s derivations. Draw waiting time for each event from waiting time for each event from an exponential processan exponential process
Cyberinfrastructure for Phylogenetic Research
Mutations in RNA
?Advantageous
Neutral
Deleterious
Cyberinfrastructure for Phylogenetic Research
Folding energy based fitness model
Assumption: Thermodynamically more stable structure is more fit.
-491.07J/mol -636.71J/mol
Cyberinfrastructure for Phylogenetic Research
2.2 A Free Energy Based Schema
M0 (E0)
M1 (E1)
M2 (E2)
M3 (E3)Mi (Ei)
Mn (En)
.
.
.
. . .
Cyberinfrastructure for Phylogenetic Research
For each ancestral RNA molecule, enumerate all its mutants.
Compute Ei – free energy of a RNA molecule Mi
RNAeval from Vienna RNA package computes Ei for all possible single mutants of a RNA molecule in 5~6 minutes using one CPU (2 ghz).
Draw new descendent molecule according to convolution of mutation probability and fixation probability from free energy calculations.
Enumerate Energy (=fitness) landscape Enumerate Energy (=fitness) landscape around ancestor, Find minimum (most fit)around ancestor, Find minimum (most fit)
In the descendent, assume that the energy In the descendent, assume that the energy differential to local minimum is the same as differential to local minimum is the same as the ancestor. Sample a new mutation, the ancestor. Sample a new mutation, accept-reject as a conditional event vis-à-vis accept-reject as a conditional event vis-à-vis the local minimumthe local minimum
Cyberinfrastructure for Phylogenetic Research
New RNA macro simulatorNew RNA macro simulator
Can simulate folding-energy dependent evolution efficiently (estimate Can simulate folding-energy dependent evolution efficiently (estimate 30 days for 1 million taxa on 20 CPU 2ghz cluster)30 days for 1 million taxa on 20 CPU 2ghz cluster)
Produces secondary structure changes and records history of changesProduces secondary structure changes and records history of changes Produces indel events and produces alignment history--will output files Produces indel events and produces alignment history--will output files
with indels and the correct alignmentwith indels and the correct alignment Parameterized with empirical data statistics (Hillis, Gutell)Parameterized with empirical data statistics (Hillis, Gutell)
Cyberinfrastructure for Phylogenetic Research
Alignment
Top is homologous alignment. Bottom is Clustalw alignment.
First sequence is root RNA, others are randomly chosen leaf RNA’s
Statistics from 100 Statistics from 100 Eukaryote ssRNAEukaryote ssRNA
rate heterogeneity: 0 (no gamma dist.), gamma=1rate heterogeneity: 0 (no gamma dist.), gamma=1 Sample 660 sites from each dataset without replacementSample 660 sites from each dataset without replacement Call PAUP Hsearch with default settings and time limit=6hrsCall PAUP Hsearch with default settings and time limit=6hrs Report best parsimony score at each secondReport best parsimony score at each second
Let B(t) be the best parsimony score at time t; let B(0) be the score of Let B(t) be the best parsimony score at time t; let B(0) be the score of the starting treethe starting tree
B is monotonically decreasingB is monotonically decreasing Assume we run the heuristic search for Assume we run the heuristic search for
6 hrs. The NPSE is defined as6 hrs. The NPSE is defined as
Simulation DatabaseSimulation Database Scaling and Code HardeningScaling and Code Hardening
Developed New Extensions to IndexingDeveloped New Extensions to Indexing Developed and tested with a test suiteDeveloped and tested with a test suite
Experimental EvolutionExperimental Evolution Generated 16 new lineages evolved for 350 generations under Generated 16 new lineages evolved for 350 generations under
heterogeneous conditionsheterogeneous conditions 50% of whole genomes sequenced50% of whole genomes sequenced
Cyberinfrastructure for Phylogenetic Research
Simulation, Modeling, and Simulation, Modeling, and BenchmarksBenchmarks
U Penn:U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Junhyong Kim, Sampath Kannan, Susan Davidson
Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen
U Texas :U Texas : David Hillis, Lauren Meyers David Hillis, Lauren Meyers