Bioinformática: Inferência filogenética WHY DO WE CARE ? Rita Castilho, [email protected]What for? Sistemática Molecular Evolução Ecologia Forense Medicina (evolução de vírus, vacinas, desenvolvimento de drogas) Uses of phylogenies: Sistemática • Similar organisms are grouped together • Clades share common evolutionary history • Phylogenetic classification names clades Source: Inoue, J.G., Miya, M., Tsukamoto, K., Nishida, M. 2003. Basal actinopterygian relationships: a mitogenomic perspective on the phylogeny of the “ancient fish”. Molecular Phylogenetics and Evolution, 26: 110-120. Pryer et al. 2001
37
Embed
Bioinformática: Inferência filogenética WHY - rcastilho.ptrcastilho.pt/BI2017/Main_files/BIOINFO_2017.pdf · Uses of phylogenies: Co-evolution • Compare divergence patterns in
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Sequences 1 and 2 differs at 1 out of 3 positions = 1/3 Sequences 1 and 3 differs at 1 out of 3 positions = 1/3 Sequences 2 and 3 differs at 1 out of 3 positions = 1/3
Where P is the proportion of nucleotides that are different (the observed differences above) in the two sequences and ln is the natural log function. To calculate the JC distances from the observed differences above:
Kimura's Two Parameter model (K2P) incorporates the observation that the rate of transitions per site (a) may differ from the rate of transversions (b), giving a total rate of substitiutions per site of (a + 2b)(there are three possible substitutions: one transition and two transversions). The transition:transversion ratio a/b is often represented by the letter kappa (k).
In the K2P model the number of nucleotide substitutions per site is given by:
where: P the proportional differences between the two sequences due to transitions Q are the proportional differences between the two sequences due to transitions and transversions respectively.
AGC AAC ACC
d = 12ln 11− 2P −Q⎡⎣⎢
⎤⎦⎥+ 14
11− 2Q⎡⎣⎢
⎤⎦⎥
K80 model (Kimura, 1980) orKimura 2P
Sequences 1 and 3 differ one transversion Sequences 2 and 3 differ one transversion
AGC AAC
Sequences 1 and 2 differ one transition
AGC ACC
AAC ACC
1 2 3
1 -
2 0.549 -
3 0.477 0.549 -
1 2 3
1 -
2 0.549 -
3 0.477 0.549 -
1 2 3
1 -
2 0.441 -
3 0.441 0.441 -
1 2 3
1 -
2 0.333 -
3 0.333 0.333 -
Observed differences
Jukes-Cantor model
Kimura 2P
Note how the differences caused by the application of different models give different distances Estimating Genetic Differences
0 25 50 750
0.5
1.0
1.5Expected differences
Observed differences
Time
Diff
eren
ces
betw
een
sequ
ence
s0.333
JC: 0.441 K2P: 0.477-0.549
Molecular Clock
Proposed that for any given protein, the rate of molecular evolution is approximately constant over time in all lineages.
“Root”: common ancestor of organisms in the phylogeny
Reading trees
Nó ancestral ou Raíz da árvore
B C DA
Internal branch: common ancestor of a subset of species in the tree
Reading trees
Ramos ou linhagens
B C DA
“Node”: point of divergence of two species
Reading trees
Nós internos ou pontos de divergência (representam ancestrais hipotéticos dos taxa)
B C DA
“Leaf”: terminal branch leading to a species
Reading trees
Nós terminais
B C DA
Clade: group of species descended from a common ancestor
Reading trees
B C DA
Star phylogeny No resolution
Partially resolved phylogeny
Fully resolved phylogeny
B
B
C
C
C
E
E
E
D
D D
Polytomy Bifurcation
Phylogenetic inference resolves the association order of lineages
A A A
B
Phylogenies = Evolutionary relationships
((A,(B,C)),(D,E)) = phylogeny
B - C closer, sister clade A
Taxon A
Taxon B
Taxon C
Taxon E
Taxon D
This dimension can: •be proportional to genetic distance (diferences) = phylogram or adictive trees; •be proportional to time = ultrametric trees; •have no scale what so ever.
A - B - C, sister clade D - E
If there was a temporal or genetic scale then D - E taxa are the closest related, and diverged more recently
All of these rearrangements show the same evolutionary relationships between
the taxaB
A
C
D
A
B
D
C
B
C
AD
B
D
AC
B
ACD
B
A
C
D
A
B
C
D
Mobiles
A C
B D
Tree 1
A B
C D
Tree 2
A B
D C
Tree 3
Phylogenetic tree building (or inference) methods are aimed at discovering which of the possible unrooted trees is "correct".
We would like this to be the “true” biological tree — that is, one that accurately represents the evolutionary history of the taxa. However, we must settle for discovering the computationally correct or optimal tree for the phylogenetic method of choice.
The number of unrooted trees increases in a greater than exponential manner with number of taxa
An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees
The unrooted tree 1:
A C
B D
Rooted tree 4
C
D
A
B
4
Rooted tree 3
A
B
C
D
3
Rooted tree 5
D
C
A
B
5
Rooted tree 2
A
B
C
D
2
Rooted tree 1
B
A
C
D
1
These trees show five different evolutionary relationships among the taxa!
1, 2, 3, 4 and 5 possible roots
CA
B D
Each unrooted tree theoretically can be rooted anywhere along any of its branches
A D
B E
C
Each unrooted tree theoretically can be rooted anywhere along any of its branches
CA
B D
A D
B E
C
F
Each unrooted tree theoretically can be rooted anywhere along any of its branches
CA
B D
A D
B E
C
Each unrooted tree theoretically can be rooted anywhere along any of its branches
CA
B D
A D
B E
C
A D
B E
C
F
Taxa Unrooted trees X roots Rooted trees
3 1 3 3
4 3 5 15
5 15 7 105
6 105 9 945
7 945 11 10 395
8 10 935 13 135 125
9 135 135 15 2 027 025
30 3.58 x 1036 57 2.04 x 1038
For 10 sequences there are more than 34 million rooted trees
For 20 sequences there are
8,200,794,532,637,891,559,000 trees.
In a recent study of 135 human mtDNA sequences there were potentially
2.113 x10 267 trees.
This number is larger than number of particles known in the universe!!
Mid-point rooting Outgroup rooting
D
C
E
B
G H
F
J
I
K
A
Grouping 2Grouping 1
D
C
E G
F
B
A
J
I
KH D
C
B
E G
F
H
A
J
I
K
Grouping 3
Monophyletic. In this tree, grouping 1, consisting of the seven species B–H, is a monophyletic group, or clade. A mono- phyletic group is made up of an ancestral species (species B in this case) and all of its descendant species. Only monophyletic groups qualify as legitimate taxa derived from cladistics.
Paraphyletic. Grouping 2 does not meet the cladistic criterion: It is paraphyletic, which means that it consists of an ancestor (A in this case) and some, but not all, of that ancestor’s descendants. (Grouping 2 includes the descendants I, J, and K, but excludes B–H, which also descended from A.)
Polyphyletic. Grouping 3 also fails the cladistic test. It is polyphyletic, which means that it lacks the common ancestor of (A) the species in the group. Further-more, a valid taxon that includes the extant species G, H, J, and K would necessarily also contain D and E, which are also descended from A.
Phylogenetic MethodsDistance: • Tree based on pairwise distances between sequences • Evolutionary models applied to pairwise distances to account for multiple substitutions per site and rate heterogeneity among sites
Parsimony: • Minimize the number of substitutions • Assumes sites are independent • Assumes 1 substitution per site
Maximum Likelihood: • Maximize probability of sequences given tree • Evolutionary models applied to each position to account for multiple substitutions per site and rate heterogeneity among sites • Gives single tree with highest likelihood • Assumes sites are independent
Bayesian: • Maximize posterior probability of tree given sequences • Evolutionary models applied to each position to account for multiple substitutions per site and rate heterogeneity among sites • Integrates over all trees • Assumes sites are independent
Molecular phylogenetic tree building methods Are mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as follows:
! Métodos principais de filogenia molecular ! UPGMA
Fifth round
Sixth round
Note the this method identifies the root of the tree
A,B,C D,E F
A,B,C -
D,E 6 -
F 8 8 -
(A,B,C)(D,E)(A,B,C)(D,E) -
F 8
! Métodos principais de filogenia molecular ! UPGMA
UPGMA fails when rates of evolution are not constant
A tree in which the evolutionary rates are not equal
From http://www.icp.ucl.ac.be/~opperd/private/upgma.html
A B C D E B 5 C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8Método de distância
NJ
! Métodos principais de filogenia molecular ! NJ
The neighbor-joining method of Saitou and Nei (1987). Is especially useful for making a tree having a large number of taxa.
Begin by placing all the taxa in a star-like structure.
Making trees using neighbor-joining
Shortest pairs are chosen to be neighbors and then joined in distance matrix as one OTU.
! Métodos principais de filogenia molecular ! NJ
Tree-building methods: Neighbor-joining
Next, identify neighbors (e.g. 1 and 2) that are most closely related. Connect these neighbors to other OTUs via an internal branch, XY. At each successive stage, minimize the sum of the branch lengths.
! Métodos principais de filogenia molecular ! NJ
Tree-building methods: Neighbor joining
Define the distance from X to Y by
dXY = 1/2(d1Y + d2Y – d12)
! Métodos principais de filogenia molecular ! NJ
The neighbor joining method joins at each step, the two closest sub-trees that are not already joined. It is based on the minimum evolution principle. One of the important concepts in the NJ method is neighbors, which are defined as two taxa that are connected by a single node in an unrooted tree
A B
Node 1
! Métodos principais de filogenia molecular ! NJ
A
B
C
D
E
A
B
5
C
4
7
D
7
10
7
E
6
9
6
5
F 8 11 8 9 8
B
C
D
E
F
A
! Métodos principais de filogenia molecular ! NJ
We have in total 6 OTUs (N=6).
Step 1: We calculate the net divergence r (i) for each OTU from all other OTUs
Step 3: Now we choose as neighbors those two OTUs for which Mij is the smallest. These are A and B and D and E. Let's take A and B as neighbors and we form a new node called U ( joining AB).
Step 5: Now, N is N-1 = 5, and the entire procedure is repeated starting at step 1
U
C
D
E
F
U
C
3
D
6
7
E
5
6
5
F
7
8
9
8
B
C
D
E
F
A1
4
! Métodos principais de filogenia molecular ! NJ
B
C
D
E
F
A
1
1
0.5
4
21
4.752.25
2.75
UPGMA
NJ
ROOT
Comparison of UPGMA and NJ
Neighbor Joining finding shortest (minimum evolution) tree by finding neighbors that minimize the total length of the tree. Shortest pairs are chosen to be neighbors and then joined in distance matrix as one OTU.
the algorithm does not assume that the molecular clock is constant for sequences in the tree. If there are unequal substitution rates, the tree is more accurate than UPGMA.
Distance Methods: evolutionary distances (number of substitutions) are computed for all pairs of taxa.
UPGMA unweighted pairgroup method with arithmetic means.
assumes equal rate of substitutions (therefore is always rooted, as the taxa that has accumulated more sequences is evidently older) (if the substitutions rates are different among taxa, then the tree maybe wrong)
sequential clustering algorithmspairs of taxa are clustered in order of decreasing similarity
Parsimony
Molecular phylogenetic tree building methods Are mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as follows:
! Métodos principais de filogenia molecular ! Maximum parsimony ! Métodos principais de filogenia molecular ! Maximum parsimony
William of Ockham (or Occam) was a 14th-
century English logician and Franciscan friar
who's name is given to the principle that when
trying to choose between multiple competing theories the simplest one is probably the best.
This principle is known as Ockham's razor.
! Métodos principais de filogenia molecular ! Maximum parsimony
Parsimony: • Minimize the number of substitutions • Assumes sites are independent • Assumes <1 substitution per site
Tree 1
1 (A) 2(G) 3(A) 4 (G)
2 changes
A
Species 1 2 3 4
Data A G A G
Tree 2
1 (A) 2(G) 3(A) 4 (G)
G
2 changes
Fitch (1971) Systematic Zoology 20:406-416
! Métodos principais de filogenia molecular ! Maximum parsimony
Parsimony: • Minimize the number of substitutions • Assumes sites are independent • Assumes <1 substitution per site
Tree 1 and 2 Tree 3
1 (A) 2(G) 3(A) 4 (G)
(A or G) (A or G)
2 changes
(A or G)
1 (A) 2(G)3(A) 4 (G)
(A) (G)
(A or G)
1 change
Species 1 2 3 4
Data A G A G
Fitch (1971) Systematic Zoology 20:406-416
! Métodos principais de filogenia molecular ! Maximum parsimony
Parsimony: • Minimize the number of substitutions • Assumes sites are independent • Assumes <1 substitution per site
Tree 1 Tree 2
1 (A) 2(G) 3(A) 4 (G)
(A or G) (A or G)
2 changes
(A or G)
1 (A) 2(G)3(A) 4 (G)
(A)(G)
(A or G)
1 change
More parsimonious
Fitch (1971) Systematic Zoology 20:406-416
Species 1 2 3 4
Data A G A G
! Métodos principais de filogenia molecular ! Maximum parsimony
Parsimony methods
Optimality criterion: The ‘most-parsimonious’ tree is the one that requires the fewest number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements) to explain the sequences.
Advantages: • Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’). • Can be used on molecular and non-molecular (e.g., morphological) data. • Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy) • Can be used for character (can infer the exact substitutions) and rate analysis • Can be used to infer the sequences of the extinct (hypothetical) ancestors
Disadvantages: • Not based on statistical properties • Can be fooled by high levels of homoplasy (‘same’ events)
Molecular phylogenetic tree building methods:
Are mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as follows:
• Parsimony seeks solutions that minimize the amount of change required to explain the data (underestimates superimposed changes)
• ML attempts to estimate the actual amount of change (by specifying the evolutionary model that will account for the data with the highest likelihood)
• Methods that incorporate models of evolutionary change can make more efficient use of the data
! Métodos principais de filogenia molecular ! Maximum likelihood
Maximum likelihood (ML) methodsOptimality criterion: ML methods evaluate phylogenetic hypotheses in terms of the probability that a proposed model of the evolutionary process and the proposed unrooted tree would give rise to the observed data. The tree found to have the highest ML value is considered to be the preferred tree.
Advantages: • Are based on explicit model of evolution. • Usually the most ‘consistent’ of the methods available. • Can be used for character (can infer the exact substitutions) and rate analysis. • Can be used to infer the sequences of the extinct (hypothetical) ancestors. • Can help account for branch-length effects.
Disadvantages: • Are based on explicit model of evolution. • Are not as simple and intuitive as many other methods. • Are computationally very intense (Iimits number of taxa and length of sequence). • Slooooow!!! • Violations of the assumed model can lead to incorrect trees.
! Métodos principais de filogenia molecular ! Maximum likelihood
6 faces 8 faces 12 faces
Ideia from Gavin Naylor
! Métodos principais de filogenia molecular ! Maximum likelihood
Roll the diceHow many points?
Ideia from Gavin Naylor
6 faces 8 faces 12 faces
! Métodos principais de filogenia molecular ! Maximum likelihood
Roll the diceHow many points?
Ideia from Gavin Naylor
6 faces 8 faces 12 faces
14 POINTS
! Métodos principais de filogenia molecular ! Maximum likelihood
Para um resultado de 14, necessitamos de usar dois dados. Qual o par de dados que mais provavelmente originará esse
resultado?
Ideia from Gavin Naylor
6 faces 8 faces 12 faces
For a 14 points results we need 2 dices. Which is the pair of dices that most probably originates that result?
! Métodos principais de filogenia molecular ! Maximum likelihood
Equivalente a: qual a árvore que mais provavelmente terá originado essas sequências?
Ideia from Gavin Naylor
6 faces 8 faces 12 faces
Which tree is most likely to have yielded these sequences?
! Métodos principais de filogenia molecular ! Maximum likelihood
6 + 8
+ + +
How many ways of obtaining the score “14” are there for each pair?
2 + 12 3 + 11 4 + 10
5 + 96 + 8
1 5 7Ideia from Gavin Naylor
How many possible combinations?
2 + 12 3 + 11 4 + 10
5 + 96 + 87 + 98 + 7
! Métodos principais de filogenia molecular ! Maximum likelihood
6 + 8
+ + +
How many ways of obtaining the score “14” are there for each pair?
2 + 12 3 + 11 4 + 10
5 + 96 + 8
1 5 7
Ideia from Gavin Naylor
2 + 12 3 + 11 4 + 10
5 + 96 + 87 + 98 + 7
1/6 x 1/8
= 1/481/6 x 1/12 1/8 x 1/12
= 1/ 72 = 1/96
5 7
Probability of each combination?
! Métodos principais de filogenia molecular ! Maximum likelihood
6 + 8
+ + +
How many ways of obtaining the score “14” are there for each pair?
2 + 12 3 + 11 4 + 10
5 + 96 + 8
1 5 7
Ideia from Gavin Naylor
2 + 12 3 + 11 4 + 10
5 + 96 + 87 + 78 + 6
1/6 x 1/8
= 1/481/6 x 1/12 1/8 x 1/12
= 1/ 72 = 1/961/48 x 1 1/ 72 x 5 1/96 x 7
! Métodos principais de filogenia molecular ! Maximum likelihoodNow multiply ways of obtaining the score
“14” by the probability of any single outcome to get the likelihood.
+ + +
1/48 x 1 1/ 72 x 5 1/96 x 7
0.07290.06940.0208
Notice that none of the likelihoods are very “likely”, but (8+12) is more likely than the other two
Ideia from Gavin Naylor
! Métodos principais de filogenia molecular ! Maximum likelihood
HOW DO WE CONVERT THIS RATIONALE TO
MAXIMUM LIKELIHOOD ESTIMATIONS?
! Métodos principais de filogenia molecular ! Maximum likelihood
1. Calculate likelihood for each site on a specific tree.
! Métodos principais de filogenia molecular ! Maximum likelihood
1. Calculate likelihood for each site on a specific tree.
A likelihood de uma das posições do alinhamento, neste caso a posição 5, é igual à soma de todas as possíveis reconstruções nos nós 5 e 6.
! Métodos principais de filogenia molecular ! Maximum likelihood
A likelihood da árvore é o produto de todas as likelihoods individuais de todos os sites do alinhamento.
A likelihood é a soma dos logaritmos das likelihoods de cada local
! Métodos principais de filogenia molecular ! Maximum likelihood
1. Calculate likelihood for each site on a specific tree.