Top Banner
The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.
48

The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Jan 20, 2016

Download

Documents

Bertram Dixon
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

The deep phylogeny problem

Using simple models to estimate trees from sparse data sets with

faintly relevant signals.

Page 2: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Long history of interest in the relationships among

major groups of animals.

Page 3: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Ernst Haeckel 1834-1919

Page 4: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Bateson, W., 1886, The ancestry of the Chordata: Quarterly Journal of Microscopical Science, v. 26, p. 535-571.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Cope, E. D., 1887, The Origin of the Fittest: New York, Appleton & Company.

Page 5: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Strong resurgent interest in late 20th Century with the advent of Molecular Phylogenetics

Most early analyses were based on 18S rRNA.

First influential paper: 1988

Molecular phylogeny of the animal kingdom.

Field et al. Science 239: 748-753

Page 6: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Early enthusiasm suggested 18s sequence comparisons were going to solve all of our problems.

Limitations of Metazoan 18S rRNA Sequence Data: Implications for Reconstructing a Phylogeny of the Animal Kingdom and Inferring the Reality of the Cambrian Explosion.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.Abouheif, Zardoya & Meyer. 1998

But within 10 years:

Page 7: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

At about the same time (mid 1990’s) there was an emerging interest in estimating animal phylogenies from whole MtDNA genome sequences.

The choice was appealing:

But even with large amounts of data some quite controversial groupings emerged - and different mitochondrial genes would often suggest conflicting relationships.

Obviously, it was claimed, we just didn’t have enough data…

•Large amount of sequence (16-18kb).•Reasonably easy to collect (no introns)•Mode of inheritance was well understood.•Almost no problems associated with paralogous comparisons.

Page 8: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

In 1998, I published a study with Wes Brown in which we explored the phylogenetic signal in the mitochondrial genome of a group of vertebrates whose phylogenetic relationships were “uncontroversial.”

sea urchin1

sea urchin2

lancelet

carp

trout

frogchicken

fruit fly

nematode 2

mosquito

snail

nematode 1

lamprey

opossum

mouse

rat

cow

blue whale

fin-back whale

Obtained

carp

trout

frog

opossum

mouse

rat

cow

blue whale

fin-back whale

chicken

sea urchin1

sea urchin2

lancelet

nematode 2

nematode 1

fruit fly

mosquito

snail

lampreyQuickTime™ and a

TIFF (LZW) decompressorare needed to see this picture.

Page 9: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

The unexpected placement of Lancelet outside (vertebrates + echinoderms) and the grouping of (frog+ chicken+fishes) results from parsimony analyses with strong bootstrap support at all levels of analysis (nucleotides, transversions and amino acids).

Page 10: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Likelihood analysis of the nucleotide data under the 16 canonical models (JC, K2P, HKY, GTR + I +) all failed to yield the expected tree, placing cephalochordates outside (vertebrates+ echinoderms)

1 = 174879.9709

2 = 174903.3324

p = 0.3642

1 = 175160.793

2 = 175175.196

p = 0.5739

1 = 180109.353

2 = 180030.772

p = 0.0089

1 = 181340.654

2 = 181204.7859

p< 0.0001

I+G

1 = 174980.1973

2 = 175003.3906

p = 0.3796

1 = 175238.975

2 = 175252.093

p = 0.6165

1 = 180223.3956

2 = 180146.4203

p = 0.012

1 = 181487.388

2 = 18134.7965

p< 0.0001

G

1 = 179573.3547

2 = 179525.1419

p = 0.1265

1 = 180474.2233

2 = 180403.5445

p = 0.31

1 = 183988.9626

2 = 183811.3409

p< 0.0001

1 = 184834.253

2 = 184611.3826

p< 0.0001

I

1 = 184936.447

2 = 184809.982

p= 0.0020

1 = 186023.7146

2 = 185874.1456

p= 0.0004

1 = 189228.2482

2 = 188966.6047

p< 0.0001

1 = 190018.6565

2 = 189706.5707

p< 0.0001

Equal rates

GTR HKY 85 Kimura 2PJukes Cantor

Naylor and Brown 1998•Expected Tree = 1•MPT = 2

Page 11: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Assuming the results to be misleading, we evaluated which kind of sites might be responsible for the misleading patterns by testing the fit of different classes of characters to the expected tree.

Naylor and Brown 1998

Expected tree

Page 12: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

We were able to retrieve the expected tree only when we restricted our analyses to the subset of nucleotide sites modally coding for the amino acids P, C, N, M and Q.

Hydrophobic residues I, L and V were found to be especially misleading.

We concluded (in 1998) that simply sequencing large amounts of sequence wasn’t enough to ensure an accurate estimate of phylogeny. We argued that it was more important to tailor models to accommodate structural and functional constraints.

(NB. At that time we were not able to conduct amino acid likelihood analyses due to computational constraints)

Naylor and Brown 1998

Page 13: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Dave Swofford has implemented AA likelihood models into PAUP. We applied the MtREV model in PAUP* to the Naylor and Brown (1998) data set to see if it yielded a different tree than that seen at the nucleotide level.

More recently (2007)

MTREV + F +

Yields expected tree -with strong support

fruit fly

mosquito

snail

sea urchin 1

sea urchin 2

lancelet

lamprey

carp

trout

frog

chicken

opossum

mouse

rat

cow

blue whale

fin-back whale

nematode 1

nematode 2

100

100

70

100

100

100

100

100

97

100

77

100

100100

97

100

100

Yields tree wherein lanceletis sister to Vertebrata - but frogstill groups w/fishes

MTREV + F

100

100

100

100

100

100

89

100

100

94

100

82

100

100

100

100

100

fruit fly

mosquito

snail

sea urchin 1

sea urchin 2

lancelet

lamprey

frog

carp

trout

chicken

opossum

mouse

rat

cow

blue whale

fin-back whale

nematode 2

nematode 1

Page 14: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

That we get strong support at all of the nodes for an incorrect topology (frog+fishes) when we do not include underscores that bootstrap support reflects the sampling variance of the signal induced from the interaction between data and model. This need not be a reflection of phylogenetic accuracy.

For the inference to be accurate, the model must be unbiased wrt the substitution process that gave rise to the data.

Results corroborate prior suspicions that modelling the substitution process appropriately is critically important

Page 15: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Can think of this as a landscape for a given combination of taxa and sequences. As sequence length increases the topography of the peaks and valleys remains roughly the same but becomes exaggerated - resulting in a more decisive landscape (little sampling variance). As model parameters are changed, the underlying pattern of peaks and valleys of the landscape will shift to a different configuration of optima.

Incr

easi

ng s

eque

nce

leng

th

Page 16: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Especially when applied to long sequences and a sparse sample of highly divergent taxa. In such cases there is little help from the data to estimate the pattern of changes. Most of the estimate comes from the model.

Take home message:

The details of the model are important.

Page 17: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Feb 2006. Delsuc et al using a “phylogenomic” approach assembled a data set of 146 EST derived genes for 38 composite taxa representing metazoan diversity (Fungi [2], Choanoflagelata[3], Cnidaria[4], Protostomia [15], Echinodermata[1] , Cephalochordata[1], Tunicata [4] and Vertebrata [8].

“Tunicates and not cephalochordates are the closest living relatives of vertebrates” Delsuc et al 2006 (and cephalochordates form a clade with echinoderms)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 18: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

They used MP, ML (WAG+F+), a Bayesian covarion model, partitioned likelihood (for each of the 146 genes)

ML methods placed Tunicates as sister to Vertebrates and Amphioxus (Branchiostoma) in a clade with echinoderms

Delsuc et al 2006

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 19: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Delsuc et al. showed that alternative topologies for the relationships among cephalochordates, echinoderms, tunicates and vertebrates had poorer fits to the data Under WAG+F+

“A definitive conclusion will only be achieved through the phylogenetic analysis of more genes combined with an increased taxon sampling including the enigmatic Xenoturbellidans, the hemichordates and a greater diversity of echinoderms”

However they cautioned:QuickTime™ and a

TIFF (LZW) decompressorare needed to see this picture.

Page 20: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

November 2006. Seemingly following the advice of Delsuc et al. 2006, Bourlat et al. added EST sequences for Xenoturbella, a hemichordate and a starfish to the data set of Delsuc et al. and augmented it for a total of 170 genes. (>35,000AA sites)

“Deuterostome Phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida”

Bourlat et al 2006

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 21: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

They were able to reproduce Delsuc et al’s tree when they removed Xenoturbella, hemichordate and starfish. This lead Bourlat et al. to conclude that Delsuc et al’s inference was an artifact of sparse taxon-sampling / model mis-specification. (They used a concatenated analysis WAG+F+)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Without Xenoturbella, hemichordate and starfishcf Delsuc et al.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

With Xenoturbella, hemichordate and starfish

Bourlat et al 2006

Page 22: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

So….. What’s going on?

Page 23: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

The fact that the data are so sensitive to taxon sampling suggests the models are inadequate. If a model describes the process well inferences should not vary as taxa are added or deleted.

Clemens Lakner reanalyzed the data set: Used a partitioned Bayesian AA model under WAG + with independent rates for each of the 170 gene partitions. Same result as Bourlat et al.

So…. if it’s a model problem, it’s not one that can be fixed with a simple rate multiplier tailored to each gene.

Poor Models?

Page 24: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Orthology can be a problem with ESTs because putative othologs in different taxa are ultimately identified by sequence similarity, not phylogenetic analysis.

Typically orthologs are identified by bi-directional Blast hits.

Non-orthologous gene comparisons?

However there are situations in which pairs of strings meeting this criterion for “orthology” will not be true orthologs (rapid evolution of an ortholog in one species can render it more dissimilar to its true ortholog in another species than it is to a paralog in that same species)

XXXXXX

XXXXXX

A B

orthologous

XXXXXX

XXXXXX

A B

non-orthologous

Page 25: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

We filtered the resulting topologies into those that were consistent with 3 positive controls: Monophyletic: (1) vertebrates, (2) insects (3) echinoderms.

Only 16 of the 170 genes met the criteria.(?!!)

We contrasted the signal in the original set of 170 trees with that of the filtered set of 16 genes meeting the +ve control criteria using consensus networks implemented in Splits Trees 4. (Huson and Bryant, 2006)

In order to evaluate paralogy as a possible source of error, we computed MP bootstrap trees for each of the 170 genes in the Bourlat et al. data set.

Page 26: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Network consensus of 170 bootstrap parsimony trees

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Network consensus of 16 trees that meeting +ve control criteria

RESULTS

Interesting…

tunicates

Page 27: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

But amino acid likelihood of 16 gene subset yields tree with Cephalochordates + Echinoderms and other strange groupings.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 28: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Something is awry.

Back to first principals…

Apparently no “quick fix” for these issues

Page 29: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Both multiple alignment and protein structural energetics suggest that AAs are restricted in what they can change to over the course of evolution.

What are the observed patterns of change in molecules?

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

alignment

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

energetics

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 30: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

But current models average AA frequency over entire alignment.

00.01

0.020.03

0.040.050.06

0.070.08

0.090.1

A C D E F G H I K L M N P Q R S T V W Y

20 stationary equilibrium frequencies (avg. from alignment)

Rate Matrix180 pairwise relative rates(JTT, WAG, MtREV)

X = Q

Page 31: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

00.01

0.020.03

0.040.050.06

0.070.08

0.090.1

A C D E F G H I K L M N P Q R S T V W Y

20 stationary probabilitiesequilibrium frequencies averaged over alignment

Poor description of reality(for this site).

A C D E F G H I K L M N P Q R S T V W Y

0.010.020.030.040.050.060.070.080.09

0

0.1

Site specific vector of 20 probabilities

Better

Consider this site

Page 32: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Not possible to have a separate model tailored to each site (too many parameters) - but possible to assign sites to “categories” with comparable evolutionary freedom to vary.

Can have a model tailored to each category and implement a “mixture” of models .

Lartillot (2007) proposed such a mixture model to allow categories of sites associated with different biochemical roles to have different AA equilibrium frequencies. (He has implemented this in his Phylobayes software)

Page 33: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Lartillot CAT (mixture) Model

AA Equil.Freq.profiles

Categories (models) 1 2 3 ….. KA C D E F G H I K L M N P Q R S T V W Y

0.010.020.030.040.050.060.070.080.09

0

0.1

Site specific vector of 20 probailities

Yields a mixture of distributions that better capture the allowable state-space

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

1)QuickTime™ and a

TIFF (LZW) decompressorare needed to see this picture.2)

3)

Multiply each distribution by rate matrix (WAG, JTT, MtREV etc) From Lartillot 2007

Page 34: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

CAT model can ameliorate model mis-specification for some data sets.

We applied it to the Bourlat et al. data set.

Resulted in inferences that still show sensitivity to taxon sampling, suggesting model is not adequate.

Page 35: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

•Get taxon-sampling dependent inferences for WAG, and CAT.

•Suggests models are inadequate.

•What else might be going on?

What’s going on?

Page 36: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Alpha -helical bundle (rhodopsin) Beta-barrel (porins)

We know amino acid sequences code for structures.

Page 37: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

We know that structures show limited variation among lineagesBUT they do show a little.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Superimposed backbones of 28 Hurudinin structes (PDB_ID 4H1R)

Page 38: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Rate variation across cytB (courtesy Jun Inoue)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

We also know that patterns of substitution vary across both sites and taxa.

Page 39: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Rate variation among lineages based on whole MtDNA (Courtesy Jun Inoue)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

With consequences for phylogenetic branch lengths.

Page 40: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Sites showing differences in freedom to vary between primates and fishes

It is possible (likely?) that minor conformational changes in some non-critical parts of structures affect the local freedom to vary of sites in lineage specific ways?

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Primates

Fishes

Page 41: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Potential (practical) strategies:

(1) Ensure that input data meet some minimum criteria that ensures orthology.

(2) Minimize among lineage heterogeneity by excluding genes and/or sites that exhibit non-stationary dynamics. (Housekeeping genes deeply embedded in the genetic architecture with similar constraints across taxa)

(3) Optimize parameters on a (structurally informed) gene-by-gene basis to accommodate context dependent evolutionary change.

If true, such changes in freedom to vary over a tree would require that amino acid frequencies of mixture models should be allowed to change over the tree. (mixture model : covarion hybrid)

OUTLOOK

Page 42: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Collecting more and more ESTs about which we know little does not look promising (to me).

Page 43: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

SUMMARY

• As data sets include more characters, sampling variance decreases and we are no longer shielded from the effects of model mis-specification

• Accurate estimates are likely to come from a better appreciation of the transformational tendencies associated with individual sites. (Biochemically motivated process models)

• Until then we will have to prop up our inadequate models with thoughtful taxon sampling.

• Phylogenomics as currently practised is close to the worst case scenario (Long sequences, Ambiguous orthology, Divergent taxa, Sparse taxon-sampling).

Page 44: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Acknowledgements:

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Clemens Lakner

Mark Holder

Page 45: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Interesting aside..

Page 46: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

Lartillot Brinkmann & Phillippe (2007) published a paper advocating use of the CAT model. Results they present are at odds with their paper the previous year (Delsuc et al. 2006) but consistent with classical vertebrate phylogeny they had overturned in 2006.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Posterior consensus CAT+F+

classical vertebratePhylogeny!

Page 47: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Traditional phylogeny based on morphology and embryology(after Hyman)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

New molecule-based phylogeny (18s)

Summarized by Adoutte et al 2000.

Page 48: The deep phylogeny problem Using simple models to estimate trees from sparse data sets with faintly relevant signals.

ESTs are fragments of expressed genes cloned from a cDNA library. They are produced by single-pass sequencing from one end of a cDNA clone.

They are generally of poor quality. Many are short (<200bp). But bioinformatic pipelines have been constructed to sort and filter them.

EST fragments deemed usable are “blasted” against reference data bases. Sequence similarity is used to ascertain “identity” and by transitivity “function” of sequences

EST projects are underway for several organisms. Milions of bases pour in to data bases every day, providing potentially useful comparative data.

Many phylogenetic researchers have seized the opportunity to assemble data sets of what they consider to be orthologous ESTs in different taxa.

What are ESTs anyway?