1 Phylogenetic analyses Roman Biek Institute of Biodiversity, Animal Health & Comparative Medicine [email protected]EEID Evolution workshop 2012 0 What is phylogenetics? Reconstructing the ancestral relationships among taxa in the form of genealogical trees Taxa can be species, individuals or particular genes Tree is only an estimate => “truth” usually unknown 1 Intro A simple four taxa example 2 Who is our closest relative? ? ? ? 3 The overall aim ATTTCTCTG ATTTCCTTA ATGTCCTTA ATGTCCTTA ATGTCCTCA Analysis and Interpretation Measure variation at the molecular level Develop models that fit the observed patterns Infer process from patterns Non-taxonomic questions 4 Molecular clocks e.g. “How long ago since two groups split?” Selection e.g. “Which sites have undergone adaptive change?” Ancestral state change e.g. “Movement rate between population A and B?” Demographic reconstruction e.g. “How has population size changed through time?” 5 Outline Basic terminology and concepts Estimating phylogenies: alignment substitution models methods for tree building quantifying uncertainty
11
Embed
Phylogenetic analyses › ~bolker › eeid › evolution › phylo › EEID_Phyl… · 4) Estimate tree(s) under that model 5) Test the reliability of the estimated tree(s) 6) Interpret
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Phylogenetic analyses
Roman Biek Institute of Biodiversity, Animal Health & Comparative Medicine [email protected]
EEID Evolution workshop 2012
0
What is phylogenetics?
Reconstructing the ancestral relationships among taxa in the form of genealogical trees Taxa can be species, individuals or particular genes Tree is only an estimate => “truth” usually unknown
1) Collect homologous sequences 2) Conduct multiple alignment 3) Fit an appropriate substitution model 4) Estimate tree(s) under that model 5) Test the reliability of the estimated tree(s) 6) Interpret and apply the phylogenetic tree 7) Potentially repeat steps 4-6 using different
tree building methods and/or additional data
15
Homology requirement
Are sequences correctly aligned so that each nucleotide position has its own unbroken history? Only an issue because sequences may contain insertions and deletions (indels)
Need algorithm that can determine the least costly alignment
A T G C G T C T T C C A C A G A !!A T G C A T C G T T C C A C A A A !!A T G C G T C -- T T C C A C A G A !!A T G C A T C G T T C C A C A A A !!
16
Models of substitution
How to measure distance between two sequences? Easiest measure would be number (or proportion) of different sites => Problem of multiple ‘hits’ at the same site
Empirical mtDNA data from bovine mammals
Jukes - Cantor model
All nucleotides undergo changes at the same rate Nucleotide frequencies are the same qA = qC = qG = qT = ¼
A T C G!A - α α α!T α - α α!C α α - α!G α α α - !
17
4
Kimura 2-parameter model
Transitions (α) (purine to purine or pyrimidine to pyrimidine subsitutions) are more common than transversions (β)
A T C G#A - β β α$T β - α β$C β β - β$G α β β -$
C T
A G
α
α
Pyrimidines
Purines
β β β β
18
Variation among sites
Some sites undergo changes more frequently than others
Can be expressed using a gamma distribution
19
20
Finding a substitution model
21
Choosing the right model
jModeltest Available from: http://darwin.uvigo.es/software/jmodeltest.html Fits up to 88 candidate models fit to your sequence data
model selection based on AIC model averaging
22
The basic steps of phylogenetic analysis
1) Collect homologous sequences 2) Conduct multiple alignment 3) Fit an appropriate substitution model 4) Estimate tree(s) under that model 5) Test the reliability of the estimated tree(s) 6) Interpret and apply the phylogenetic tree 7) Potentially repeat steps 4-6 using different
tree building methods and/or additional data
23
The basic steps of phylogenetic analysis
1) Collect homologous sequences 2) Conduct multiple alignment 3) Fit an appropriate substitution model 4) Estimate tree(s) under that model 5) Test the reliability of the estimated tree(s) 6) Interpret and apply the phylogenetic tree 7) Potentially repeat steps 4-6 using different
tree building methods and/or additional data
5
24
Estimating phylogenies
General approaches for building trees
• Distance based methods
• Maximum parsimony
• Maximum likelihood
• Bayesian methods
25
Estimating phylogenies
Involves two processes: Estimation of the topology Estimation of the branch lengths
Optimality criterion How well do the data fit a particular tree topology? Is used to compare and rank different trees Allows to search for the best tree (under given criterion)
Basic procedure Calculate pairwise distances among all sequences (according to some substitution model) Use distances to build tree (according to some rule e.g. “neighbour joining” method)
Important features Very quick way to generate tree, even for large data sets Usually no attempt to evaluate alternative trees Information about character state change is lost
28
Maximum likelihood
Basic procedure Optimality criterion: likelihood score Maximize the probability of the sequences, given a tree and its branch lengths and an evolutionary model and its parameters
Important features Allows full use of evolutionary models Relies heavily on model chosen => can be misleading if there is much variation in the substitution process among lineages Computationally much more demanding
29
Bayesian phylogenetics
Basic procedure Objective: determine the posterior distribution of trees given the sequence data Based on this distribution, ‘best’ tree can be identified
Important features Allows full use of evolutionary models Need to include priors Posterior probabilities are approximated through Markov Chain Monte Carlo methods that sample from the posterior Clade probabilities provide measure of uncertainty
6
Bayes’ rule in statistics
30
Bayesian vs. ML parameter estimation
31
Some parameter (e.g. transition/transversion ratio)
Like
lihoo
d or
p
oste
rior p
roba
bilit
y
Holder and Lewis et al 2003, Nature Reviews Genetics
32
How well supported is a grouping?
Non-parametric bootstrap Sample from the original data to create ‘new’ data sets Count how often a particular clade appears in the resampled data
Values > 70 considered strong support
Bootstrapping
“new” datasets of same size are generated from original data by sampling columns with replacement
Trees built from these new data sets The frequency with which a node appears across replicate
trees is taken as a measure of confidence for that node