Phylogenetic analyses › ~bolker › eeid › evolution › phylo › EEID_Phyl… · 4) Estimate tree(s) under that model 5) Test the reliability of the estimated tree(s) 6) Interpret

1

Phylogenetic analyses

Roman Biek Institute of Biodiversity, Animal Health & Comparative Medicine [email protected]

EEID Evolution workshop 2012

0

What is phylogenetics?

Reconstructing the ancestral relationships among taxa in the form of genealogical trees Taxa can be species, individuals or particular genes Tree is only an estimate => “truth” usually unknown

1 Intro

A simple four taxa example

2

Who is our closest relative?

?

?

?

3

The overall aim

ATTTCTCTG!!ATTTCCTTA!!ATGTCCTTA!!ATGTCCTTA!!ATGTCCTCA!

Analysis and Interpretation

Measure variation at the molecular level

Develop models that fit the observed patterns

Infer process from patterns

Non-taxonomic questions

4

Molecular clocks e.g. “How long ago since two groups split?”

Selection e.g. “Which sites have undergone adaptive change?”

Ancestral state change e.g. “Movement rate between population A and B?”

Demographic reconstruction e.g. “How has population size changed through time?”

5

Outline

Basic terminology and concepts Estimating phylogenies:

alignment substitution models methods for tree building quantifying uncertainty

2

6

The parts of a tree

7

Trees are like mobiles

8

Tree not always strictly bifurcating

9

Different ways to depict a tree

Topology only Topology + Branch lenghts

Shape of the tree

10

Inferring character state change

11

Monophyly vs Non-Monophyly

All descendents derived from one ancestor AND all descendants included

Does not include all descendants

3

12

Rooted vs unrooted trees

Multiple options for placing the root

1

2

3 4

5

A

B

C

D

B A C D

Rooting most commonly done using outgroup: taxon or taxa that fall just outside the group of interest

Number of possible trees rises quickly!

Taxa Unrooted trees Rooted trees 3 1 3 4 3 15 5 15 105 6 105 945 7 954 10,395 8 10,395 135,135 9 135,135 2,027,025

10 2,027,025 34,459,425 20 2.22E+20 8.20E+21 30 8.69E+36 4.95E+38

13

14

The basic steps of phylogenetic analysis

1)  Collect homologous sequences 2)  Conduct multiple alignment 3)  Fit an appropriate substitution model 4)  Estimate tree(s) under that model 5)  Test the reliability of the estimated tree(s) 6)  Interpret and apply the phylogenetic tree 7)  Potentially repeat steps 4-6 using different

tree building methods and/or additional data

15

Homology requirement

Are sequences correctly aligned so that each nucleotide position has its own unbroken history? Only an issue because sequences may contain insertions and deletions (indels)

Need algorithm that can determine the least costly alignment

A T G C G T C T T C C A C A G A !!A T G C A T C G T T C C A C A A A !!A T G C G T C -- T T C C A C A G A !!A T G C A T C G T T C C A C A A A !!

16

Models of substitution

How to measure distance between two sequences? Easiest measure would be number (or proportion) of different sites => Problem of multiple ‘hits’ at the same site

Empirical mtDNA data from bovine mammals

Jukes - Cantor model

All nucleotides undergo changes at the same rate Nucleotide frequencies are the same qA = qC = qG = qT = ¼

A T C G!A - α α α!T α - α α!C α α - α!G α α α - !

17

4

Kimura 2-parameter model

Transitions (α) (purine to purine or pyrimidine to pyrimidine subsitutions) are more common than transversions (β)

A T C G#A - β β α$T β - α β$C β β - β$G α β β -$

C T

A G

α

α

Pyrimidines

Purines

β β β β

18

Variation among sites

Some sites undergo changes more frequently than others

Can be expressed using a gamma distribution

19

20

Finding a substitution model

21

Choosing the right model

jModeltest Available from: http://darwin.uvigo.es/software/jmodeltest.html Fits up to 88 candidate models fit to your sequence data

model selection based on AIC model averaging

22




23




5

24

Estimating phylogenies

General approaches for building trees

•  Distance based methods

•  Maximum parsimony

•  Maximum likelihood

•  Bayesian methods

25


Involves two processes: Estimation of the topology Estimation of the branch lengths

Optimality criterion How well do the data fit a particular tree topology? Is used to compare and rank different trees Allows to search for the best tree (under given criterion)

Distance-based methods

SpA ATGCAGGTA!SpB ATGCTGCTA!SpC ATGCAGCTC!SpD TAGCAGGAC!

!! !SpA !! !SpB !!!!!!!SpC!!!!!!!!!!SpD!!!SpA!!!!!!!!(!!!SpB!! !2/9!=!0.22!!!! !!(!!!SpC!! !0.22!!!!!!!!! !0.22!!!!!!!!!!!!!!!(!!!SpD!! !0.44!!!!!!!! !0.66!!!!!!!!!4/9!=!0.44!!!!!!!(!

26 27

Distance-based methods

Basic procedure Calculate pairwise distances among all sequences (according to some substitution model) Use distances to build tree (according to some rule e.g. “neighbour joining” method)

Important features Very quick way to generate tree, even for large data sets Usually no attempt to evaluate alternative trees Information about character state change is lost

28

Maximum likelihood

Basic procedure Optimality criterion: likelihood score Maximize the probability of the sequences, given a tree and its branch lengths and an evolutionary model and its parameters

Important features Allows full use of evolutionary models Relies heavily on model chosen => can be misleading if there is much variation in the substitution process among lineages Computationally much more demanding

29

Bayesian phylogenetics

Basic procedure Objective: determine the posterior distribution of trees given the sequence data Based on this distribution, ‘best’ tree can be identified

Important features Allows full use of evolutionary models Need to include priors Posterior probabilities are approximated through Markov Chain Monte Carlo methods that sample from the posterior Clade probabilities provide measure of uncertainty

6

Bayes’ rule in statistics

30

Bayesian vs. ML parameter estimation

31

Some parameter (e.g. transition/transversion ratio)

Like

lihoo

d or

p

oste

rior p

roba

bilit

y

Holder and Lewis et al 2003, Nature Reviews Genetics

32

How well supported is a grouping?

Non-parametric bootstrap Sample from the original data to create ‘new’ data sets Count how often a particular clade appears in the resampled data

Values > 70 considered strong support

Bootstrapping

“new” datasets of same size are generated from original data by sampling columns with replacement

Trees built from these new data sets The frequency with which a node appears across replicate

trees is taken as a measure of confidence for that node

123456789!ATGCAGGTA!ATGCTGCTA!ATGCAGCTC!TAGCAGGAC!ORIGINAL!

516446789!AAGCCGGTA!TAGCCGCTA!AAGCCGCTC!TTGCCGGAC!REPLICATE1!

33

34

How well supported is a grouping?

Posterior probabilities Count the frequency of a clade within the posterior distribution of trees

Less conservative: values >95 considered strong support

35


Approaches Commonly used software Distance based methods MEGA, Geneious, Paup*,R Maximum parsimony MEGA, Geneious, Paup* Maximum likelihood MEGA, Geneious, Paup*,R, PhyML Bayesian methods Geneious, MrBayes, BEAST

Program names in bold can also have capabilities for sequence viewing and alignment.

7

36


Holder and Lewis et al 2003, Nature Reviews Genetics 37

Further resources

Molecular Evolution Workshop, Woods Hole http://workshop.molecularevolution.org/


38





Evolutionary change vs genome size

Gago et al 2009,Science 39

Molecular clocks

Mutations that are selectively neutral should accumulate over time

•  Creates expectation that genetic and temporal divergence are correlated => molecular clock

Clock rate: 0.21 (genome-1 year-1) 4.75 x 10-8 (site-1 year-1)

Mycobacterium bovis

40

Molecular clocks

Clocks traditionally calibrated using fossil data

for measurably evolving pathogens possible to measure evolutionary rate based on dated tips

41

8

Raccoon rabies in eastern North America

Raccoon rabies cases 2001, CDC

Raccoon (Procyon lotor)

Viral clocks: raccoon rabies

42

Maine

North Carolina 500 km

Rabies invasion history 1977-1999

43

Sampling scheme

44

Sampling scheme

45

Estimating the evolutionary rate

Consider viruses sampled at different points in time:

Tree has two scales: 1)  Time in years 2)  Exp. number of subst./site ⇒  Estimate evolutionary rate µ,

which gives linear relationship between the two scales

µ = 5 x 10-4 subst/site/yr or 5% per hundred years

46

Bayesian tree estimated under a molecular clock model (Drummond et al 2002, Genetics)

Phylogenies with time scales

47

9

Phylogeny reflects spatial organization

48

Phylogeny reflects spatial organization

49

Genealogies and the coalescent

Coalescent theory (Kingman 1982) relates tree shape to population history

declining growing

Emerson (2001, TREE)

50

genealogy

rate of evolution coalescent-based

estimate of population size

population size through time

Inferring the number of infected raccoons

51


Biek et al 2007,PNAS

52



53

10

Close correspondence to observed data


54

Global spread of H1N1

55

Ancestral state changes

Ancestral state reconstruction is used to infer character state change across phylogenetic tree

⇒  state change may refer to any kind of trait e.g. •  movement event •  host switch

56

Cougars (n=353) sampled along Rocky Mountains (USA/ Canada)

WCS/TKRuth

Landscape genetics of host and virus

57

Landscape genetics of host and virus

Bayesian genetic clustering approach: program GENELAND (Guillot et al. 2005, Genetics)

S

Applied to cougar microsatellite data: indicates two populations

How frequently do cougar viruses move across this

boundary?

58

Viral movement rate lower than expected?

Rate estimation repeated for 75 alternative assignments

59

11

Minimal viral movement across boundary

Mean rate only Variability across sampled trees

60


61





Summary

•  Molecular clocks predict that genetic divergence increases regularly with time

•  In rapidly evolving pathogens, possible to estimate rate of change from dated samples => allows to calibrate phylogenies

•  Can be combined with coalescent techniques to reconstruct population history through time

•  Ancestral state reconstruction can reveal movement among discrete states (e.g. geographic locations)

62

Phylogenetic analyses › ~bolker › eeid › evolution › phylo › EEID_Phyl… · 4) Estimate tree(s) under that model 5) Test the reliability of the estimated tree(s) 6) Interpret

Documents