The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

The deep phylogeny problem

Using simple models to estimate trees from sparse data sets with

faintly relevant signals.

Long history of interest in the relationships among

major groups of animals.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.


are needed to see this picture.













Ernst Haeckel 1834-1919

Bateson, W., 1886, The ancestry of the Chordata: Quarterly Journal of Microscopical Science, v. 26, p. 535-571.





Cope, E. D., 1887, The Origin of the Fittest: New York, Appleton & Company.

Strong resurgent interest in late 20th Century with the advent of Molecular Phylogenetics

Most early analyses were based on 18S rRNA.

First influential paper: 1988

Molecular phylogeny of the animal kingdom.

Field et al. Science 239: 748-753

Early enthusiasm suggested 18s sequence comparisons were going to solve all of our problems.

Limitations of Metazoan 18S rRNA Sequence Data: Implications for Reconstructing a Phylogeny of the Animal Kingdom and Inferring the Reality of the Cambrian Explosion.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.Abouheif, Zardoya & Meyer. 1998

But within 10 years:

At about the same time (mid 1990’s) there was an emerging interest in estimating animal phylogenies from whole MtDNA genome sequences.

The choice was appealing:

But even with large amounts of data some quite controversial groupings emerged - and different mitochondrial genes would often suggest conflicting relationships.

Obviously, it was claimed, we just didn’t have enough data…

•Large amount of sequence (16-18kb).•Reasonably easy to collect (no introns)•Mode of inheritance was well understood.•Almost no problems associated with paralogous comparisons.

In 1998, I published a study with Wes Brown in which we explored the phylogenetic signal in the mitochondrial genome of a group of vertebrates whose phylogenetic relationships were “uncontroversial.”

sea urchin1

sea urchin2

lancelet

carp

trout

frogchicken

fruit fly

nematode 2

mosquito

snail

nematode 1

lamprey

opossum

mouse

rat

cow

blue whale

fin-back whale

Obtained

carp

trout

frog

opossum

mouse

rat

cow

blue whale

fin-back whale

chicken

sea urchin1

sea urchin2

lancelet

nematode 2

nematode 1

fruit fly

mosquito

snail

lampreyQuickTime™ and a

TIFF (LZW) decompressorare needed to see this picture.

The unexpected placement of Lancelet outside (vertebrates + echinoderms) and the grouping of (frog+ chicken+fishes) results from parsimony analyses with strong bootstrap support at all levels of analysis (nucleotides, transversions and amino acids).

Likelihood analysis of the nucleotide data under the 16 canonical models (JC, K2P, HKY, GTR + I +) all failed to yield the expected tree, placing cephalochordates outside (vertebrates+ echinoderms)

1 = 174879.9709

2 = 174903.3324

p = 0.3642

1 = 175160.793

2 = 175175.196

p = 0.5739

1 = 180109.353

2 = 180030.772

p = 0.0089

1 = 181340.654

2 = 181204.7859

p< 0.0001

I+G

1 = 174980.1973

2 = 175003.3906

p = 0.3796

1 = 175238.975

2 = 175252.093

p = 0.6165

1 = 180223.3956

2 = 180146.4203

p = 0.012

1 = 181487.388

2 = 18134.7965

p< 0.0001

G

1 = 179573.3547

2 = 179525.1419

p = 0.1265

1 = 180474.2233

2 = 180403.5445

p = 0.31

1 = 183988.9626

2 = 183811.3409

p< 0.0001

1 = 184834.253

2 = 184611.3826

p< 0.0001

I

1 = 184936.447

2 = 184809.982

p= 0.0020

1 = 186023.7146

2 = 185874.1456

p= 0.0004

1 = 189228.2482

2 = 188966.6047

p< 0.0001

1 = 190018.6565

2 = 189706.5707

p< 0.0001

Equal rates

GTR HKY 85 Kimura 2PJukes Cantor

Naylor and Brown 1998•Expected Tree = 1•MPT = 2

Assuming the results to be misleading, we evaluated which kind of sites might be responsible for the misleading patterns by testing the fit of different classes of characters to the expected tree.

Naylor and Brown 1998

Expected tree

We were able to retrieve the expected tree only when we restricted our analyses to the subset of nucleotide sites modally coding for the amino acids P, C, N, M and Q.

Hydrophobic residues I, L and V were found to be especially misleading.

We concluded (in 1998) that simply sequencing large amounts of sequence wasn’t enough to ensure an accurate estimate of phylogeny. We argued that it was more important to tailor models to accommodate structural and functional constraints.

(NB. At that time we were not able to conduct amino acid likelihood analyses due to computational constraints)

Naylor and Brown 1998

Dave Swofford has implemented AA likelihood models into PAUP. We applied the MtREV model in PAUP* to the Naylor and Brown (1998) data set to see if it yielded a different tree than that seen at the nucleotide level.

More recently (2007)

MTREV + F +

Yields expected tree -with strong support

fruit fly

mosquito

snail

sea urchin 1

sea urchin 2

lancelet

lamprey

carp

trout

frog

chicken

opossum

mouse

rat

cow

blue whale

fin-back whale

nematode 1

nematode 2

100

100

70

100

100

100

100

100

97

100

77

100

100100

97

100

100

Yields tree wherein lanceletis sister to Vertebrata - but frogstill groups w/fishes

MTREV + F

100

100

100

100

100

100

89

100

100

94

100

82

100

100

100

100

100

fruit fly

mosquito

snail

sea urchin 1

sea urchin 2

lancelet

lamprey

frog

carp

trout

chicken

opossum

mouse

rat

cow

blue whale

fin-back whale

nematode 2

nematode 1

That we get strong support at all of the nodes for an incorrect topology (frog+fishes) when we do not include underscores that bootstrap support reflects the sampling variance of the signal induced from the interaction between data and model. This need not be a reflection of phylogenetic accuracy.

For the inference to be accurate, the model must be unbiased wrt the substitution process that gave rise to the data.

Results corroborate prior suspicions that modelling the substitution process appropriately is critically important

Can think of this as a landscape for a given combination of taxa and sequences. As sequence length increases the topography of the peaks and valleys remains roughly the same but becomes exaggerated - resulting in a more decisive landscape (little sampling variance). As model parameters are changed, the underlying pattern of peaks and valleys of the landscape will shift to a different configuration of optima.

Incr

easi

ng s

eque

nce

leng

th

Especially when applied to long sequences and a sparse sample of highly divergent taxa. In such cases there is little help from the data to estimate the pattern of changes. Most of the estimate comes from the model.

Take home message:

The details of the model are important.

Feb 2006. Delsuc et al using a “phylogenomic” approach assembled a data set of 146 EST derived genes for 38 composite taxa representing metazoan diversity (Fungi [2], Choanoflagelata[3], Cnidaria[4], Protostomia [15], Echinodermata[1] , Cephalochordata[1], Tunicata [4] and Vertebrata [8].

“Tunicates and not cephalochordates are the closest living relatives of vertebrates” Delsuc et al 2006 (and cephalochordates form a clade with echinoderms)

QuickTime™ and aTIFF (LZW) decompressor


They used MP, ML (WAG+F+), a Bayesian covarion model, partitioned likelihood (for each of the 146 genes)

ML methods placed Tunicates as sister to Vertebrates and Amphioxus (Branchiostoma) in a clade with echinoderms

Delsuc et al 2006



Delsuc et al. showed that alternative topologies for the relationships among cephalochordates, echinoderms, tunicates and vertebrates had poorer fits to the data Under WAG+F+

“A definitive conclusion will only be achieved through the phylogenetic analysis of more genes combined with an increased taxon sampling including the enigmatic Xenoturbellidans, the hemichordates and a greater diversity of echinoderms”

However they cautioned:QuickTime™ and a

TIFF (LZW) decompressorare needed to see this picture.

November 2006. Seemingly following the advice of Delsuc et al. 2006, Bourlat et al. added EST sequences for Xenoturbella, a hemichordate and a starfish to the data set of Delsuc et al. and augmented it for a total of 170 genes. (>35,000AA sites)

“Deuterostome Phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida”

Bourlat et al 2006



They were able to reproduce Delsuc et al’s tree when they removed Xenoturbella, hemichordate and starfish. This lead Bourlat et al. to conclude that Delsuc et al’s inference was an artifact of sparse taxon-sampling / model mis-specification. (They used a concatenated analysis WAG+F+)



Without Xenoturbella, hemichordate and starfishcf Delsuc et al.



With Xenoturbella, hemichordate and starfish

Bourlat et al 2006

So….. What’s going on?

The fact that the data are so sensitive to taxon sampling suggests the models are inadequate. If a model describes the process well inferences should not vary as taxa are added or deleted.

Clemens Lakner reanalyzed the data set: Used a partitioned Bayesian AA model under WAG + with independent rates for each of the 170 gene partitions. Same result as Bourlat et al.

So…. if it’s a model problem, it’s not one that can be fixed with a simple rate multiplier tailored to each gene.

Poor Models?

Orthology can be a problem with ESTs because putative othologs in different taxa are ultimately identified by sequence similarity, not phylogenetic analysis.

Typically orthologs are identified by bi-directional Blast hits.

Non-orthologous gene comparisons?

However there are situations in which pairs of strings meeting this criterion for “orthology” will not be true orthologs (rapid evolution of an ortholog in one species can render it more dissimilar to its true ortholog in another species than it is to a paralog in that same species)

XXXXXX

XXXXXX

A B

orthologous

XXXXXX

XXXXXX

A B

non-orthologous

We filtered the resulting topologies into those that were consistent with 3 positive controls: Monophyletic: (1) vertebrates, (2) insects (3) echinoderms.

Only 16 of the 170 genes met the criteria.(?!!)

We contrasted the signal in the original set of 170 trees with that of the filtered set of 16 genes meeting the +ve control criteria using consensus networks implemented in Splits Trees 4. (Huson and Bryant, 2006)

In order to evaluate paralogy as a possible source of error, we computed MP bootstrap trees for each of the 170 genes in the Bourlat et al. data set.



Network consensus of 170 bootstrap parsimony trees



Network consensus of 16 trees that meeting +ve control criteria

RESULTS

Interesting…

tunicates

But amino acid likelihood of 16 gene subset yields tree with Cephalochordates + Echinoderms and other strange groupings.



Something is awry.

Back to first principals…

Apparently no “quick fix” for these issues

Both multiple alignment and protein structural energetics suggest that AAs are restricted in what they can change to over the course of evolution.

What are the observed patterns of change in molecules?



alignment



energetics



But current models average AA frequency over entire alignment.

00.01

0.020.03

0.040.050.06

0.070.08

0.090.1

A C D E F G H I K L M N P Q R S T V W Y

20 stationary equilibrium frequencies (avg. from alignment)

Rate Matrix180 pairwise relative rates(JTT, WAG, MtREV)

X = Q



00.01

0.020.03

0.040.050.06

0.070.08

0.090.1


20 stationary probabilitiesequilibrium frequencies averaged over alignment

Poor description of reality(for this site).


0.010.020.030.040.050.060.070.080.09

0

0.1

Site specific vector of 20 probabilities

Better

Consider this site

Not possible to have a separate model tailored to each site (too many parameters) - but possible to assign sites to “categories” with comparable evolutionary freedom to vary.

Can have a model tailored to each category and implement a “mixture” of models .

Lartillot (2007) proposed such a mixture model to allow categories of sites associated with different biochemical roles to have different AA equilibrium frequencies. (He has implemented this in his Phylobayes software)

Lartillot CAT (mixture) Model

AA Equil.Freq.profiles

Categories (models) 1 2 3 ….. KA C D E F G H I K L M N P Q R S T V W Y

0.010.020.030.040.050.060.070.080.09

0

0.1

Site specific vector of 20 probailities

Yields a mixture of distributions that better capture the allowable state-space



1)QuickTime™ and a

TIFF (LZW) decompressorare needed to see this picture.2)

3)

Multiply each distribution by rate matrix (WAG, JTT, MtREV etc) From Lartillot 2007

CAT model can ameliorate model mis-specification for some data sets.

We applied it to the Bourlat et al. data set.

Resulted in inferences that still show sensitivity to taxon sampling, suggesting model is not adequate.

•Get taxon-sampling dependent inferences for WAG, and CAT.

•Suggests models are inadequate.

•What else might be going on?

What’s going on?





Alpha -helical bundle (rhodopsin) Beta-barrel (porins)

We know amino acid sequences code for structures.

We know that structures show limited variation among lineagesBUT they do show a little.



Superimposed backbones of 28 Hurudinin structes (PDB_ID 4H1R)

Rate variation across cytB (courtesy Jun Inoue)



We also know that patterns of substitution vary across both sites and taxa.

Rate variation among lineages based on whole MtDNA (Courtesy Jun Inoue)



With consequences for phylogenetic branch lengths.

Sites showing differences in freedom to vary between primates and fishes

It is possible (likely?) that minor conformational changes in some non-critical parts of structures affect the local freedom to vary of sites in lineage specific ways?







Primates

Fishes

Potential (practical) strategies:

(1) Ensure that input data meet some minimum criteria that ensures orthology.

(2) Minimize among lineage heterogeneity by excluding genes and/or sites that exhibit non-stationary dynamics. (Housekeeping genes deeply embedded in the genetic architecture with similar constraints across taxa)

(3) Optimize parameters on a (structurally informed) gene-by-gene basis to accommodate context dependent evolutionary change.

If true, such changes in freedom to vary over a tree would require that amino acid frequencies of mixture models should be allowed to change over the tree. (mixture model : covarion hybrid)

OUTLOOK

Collecting more and more ESTs about which we know little does not look promising (to me).

SUMMARY

• As data sets include more characters, sampling variance decreases and we are no longer shielded from the effects of model mis-specification

• Accurate estimates are likely to come from a better appreciation of the transformational tendencies associated with individual sites. (Biochemically motivated process models)

• Until then we will have to prop up our inadequate models with thoughtful taxon sampling.

• Phylogenomics as currently practised is close to the worst case scenario (Long sequences, Ambiguous orthology, Divergent taxa, Sparse taxon-sampling).

Acknowledgements:

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Clemens Lakner

Mark Holder

Interesting aside..

Lartillot Brinkmann & Phillippe (2007) published a paper advocating use of the CAT model. Results they present are at odds with their paper the previous year (Delsuc et al. 2006) but consistent with classical vertebrate phylogeny they had overturned in 2006.



Posterior consensus CAT+F+

classical vertebratePhylogeny!



Traditional phylogeny based on morphology and embryology(after Hyman)



New molecule-based phylogeny (18s)

Summarized by Adoutte et al 2000.

ESTs are fragments of expressed genes cloned from a cDNA library. They are produced by single-pass sequencing from one end of a cDNA clone.

They are generally of poor quality. Many are short (<200bp). But bioinformatic pipelines have been constructed to sort and filter them.

EST fragments deemed usable are “blasted” against reference data bases. Sequence similarity is used to ascertain “identity” and by transitivity “function” of sequences

EST projects are underway for several organisms. Milions of bases pour in to data bases every day, providing potentially useful comparative data.

Many phylogenetic researchers have seized the opportunity to assemble data sets of what they consider to be orthologous ESTs in different taxa.

What are ESTs anyway?

The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Documents

nucleotide data

different tree

expected tree

large amounts of data

phylogenetic relationships

strong support

amino acid likelihood

aa likelihood models