Intro to Phylogenetic Trees Lecture 6 Sections 7.1, 7.2, in Durbin et al. Chapter 17 in Gusfield Slides by Shlomo Moran and by Ydo Wexler. Modifications by Benny Chor Evolution The Tree of Life Source: Alberts et al ! " #$%&$ Tree of life- a better picture
19
Embed
Intro to Phylogenetic Trees Lecture 6bchor/CG05/CG6-trees.pdfIntro to Phylogenetic Trees Lecture 6 Sections 7.1, 7.2, in Durbin et al. Chapter 17 in Gusfield Slides by Shlomo Moran
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
�
�
Intro to Phylogenetic TreesLecture 6
Sections 7.1, 7.2, in Durbin et al.Chapter 17 in Gusfield
Slides by Shlomo Moran and by Ydo Wexler. Modifications by Benny Chor �
� ��������������� D � ������������� �*�����C��������������������������������������������;� �C��� ��������������� ������#�������>�
We start with distance based methods, considering the following question:Given a set of species (leaves in a supposed tree), and distances between them – construct a phylogeny which best “fits” the distances.
�
Exact solution: Additive sets
Given a set M of L objects with an L×L distance matrix:� d(i,i)=0, and for i�j, d(i,j)>0� d(i,j)=d(j,i).� For all i,j,k it holds that d(i,k) � d(i,j)+d(j,k).
Can we construct a weighted tree which realizes these distances?
�
��
Additive Distances (cont)
We say that the set of distances M over L objects is additive if there is a tree T, L of its nodes correspond to the L objects, with positive weights on the edges, such that for all i,j,d(i,j) = dT(i,j), the length of the path from i to j in T.
Note: Sometimes the tree is required to be binary, and then the edge weights are required to be just non-negative.
��
Distances for three objectsare always additive:
For L=3, here is always a (unique) tree with one internal node (by simple linear algebra)
( , )( , )( , )
d i j a bd i k a cd j k b c
� �
� �
� ��
�
�
i
j
k
m
Thus0
21
����� )],(),(),([),( jidkjdkidmkdc
��
How about four objects?
Not all distance matrices with 4 objects are additive, evenif they satisfy triangle inequality.E.g., no tree realizes these distances:
0l
30k
220j
2220i
lkji
��
The Four Points ConditionTheorem: A set M of distances is additive iff any subset of four objects can be labeled i,j,k,l so that:
Proof:By inspecting the figure, additivity � 4 points condition...
We call (i,j),(k,l) the “split” of {i,j,k,l}.
�
��
4P Condition � Additivity:Induction on the number of objects, L.For L � 3 the condition is empty and tree exists. Consider L=4. Denote B = d(i,k) +d(j,l) = d(i,l) +d(j,k) � d(i,j) + d(k,l) = A
Let y = (B – A)/2 � 0 (length of internal edge).
Then the tree should look as follows:We want to find the distances a,b, c and f.
a b
i j
k
m
c
y
l
n
f
Again, an instance of linear algebra
��
Tree construction for L=4
ab
i
j
k
m
c
y
l
n
f
Construct the tree by the given distances as follows:1. Construct a tree for {i, j,k}, with internal vertex m2. Add vertex n ,d(m,n) = y3. Add edge (n,l), c+f=d(k,l)
n
f
n
f
n
fRemains to prove: d(i,l) = dT(i,l)d(j,l) = dT(j,l)
��
Proof for L=4
a
b
i
j
k
m
c
y
l
n
f
By the 4 points condition and the definition of y:d(i,l) = d(i,j) + d(k,l) +2y - d(k,j) = a + y + f = dT(i,l) (the middle equality holds since d(i,j), d(k,l) and d(k,j) are realized by the tree)d(j,l) = dT(j,l) is proved similarly.
��
Splits Approach to Proof: Intuition
i
j
k l
Suppose 4 points condition holds with strict inequality, >,for every four leaves.
This defines a (2,2) partition of every quartet.Can use 4 points condition to show all quartets are consistent.
This in turn used to construct tree (homework assignment).
Finally show tree distances agreewith original distances using linearAlgebra.
�
�
Linear Algebraic Approach : Induction�Remove L-th object from the set�By induction, there is a tree, T’, for {1,2,…,L-1}.�For each pair of labeled nodes (i,j) in T’, let aij, bij, cij
be defined by the following figure:
aij
bij
cij
i
j
L
mij
1[ ( , ) ( , ) ( , )]
2ijc d i L d j L d i j� � �
�
Induction step:�Pick i and j that minimize cij.�T is constructed by adding L (and possibly mij) to T’,as in the figure. Then d(i,L) = dT(i,L) and d(j,L) = dT(j,L)� Remains to prove: For each k � i,j: d(k,L) = dT(k,L).
aij
bij
cij
i
j
L
mij
T’
��
Induction step (cont.)� Let k i,j be an arbitrary node in T’ , and let n be the
branching point of k in the path from i to j. � By the minimality of cij , (i,j),(k,L) is not a split of {i,j,k,L}. � Assume WLOG that (i,L),(j,k) is a split of {i,j, k,L}.
aij
bij
cij
i
j
L
mij
T’
k
n
��
Induction step (end)Since (i,L),(j,k) is a split, by the 4 points condition
d(L,k) = d(i,k) + d(L,j) - d(i,j)d(i,k) = dT(i,k) and d(i,j) = dT(i,j) by induction, and d(L,j) = dT(L,j) by the construction.
Hence d(L,k) = dT(L,k).QED
aij
bij
cij
i
j
L
mij
T’
k
n
��
From Additive Distance to a Tree
By following the proof, the four point condition can be used to construct a tree from a distance matrix, or to decide that there is no such tree (namely that the distance is not additive).
But this algorithm will go over all quartets, resulting in O(L4) many steps for L species (too sllllllllllllow).
The most popular method for constructing trees for additive sets uses the neighbor joining approach.
��
Constructing additive trees:The neighbor joining problem
• Let i, j be sisters (neighboring leaves) in a tree, let k be their father, and let m be any other vertex.• Using eq. we can compute the distances from k to all other leaves.
This suggest the following method to construct tree from an additive distance matrix: 1. Find sisters i,j in the tree,2. Replace i,j by their father, k, and recursively construct a
tree T for the smaller set.3. Add i,j as children of k in T.
[ ( , ) ( , )( , ) ( , )]/ 2d i m dd j m d i jk m � ��
��
Neighbor FindingHow can we find from distances alone a pair of sisters
(neighboring leaves)? Closest nodes are not necessarily neighboring leaves.
A B
CD
Next, we show a way to find neighbors from distances.��
Neighbor Finding: Seitou & Nei method
Theorem (Saitou&Nei) Assume d is additive, with all tree edge weights positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are sistertaxa in the tree.
ij
kl
m
T1T2
is a leaf
For a leaf , le ( , )t . im
i r d i m� �
, ).: Let be two leaves (out of leaves in Definitiondivergenc ( ,Then their is e ( , ) ( ) /( 2) ) i j
i j L TD i j d i j r r L� � ��
The proof is rather involved, and will be skipped (no tears pls).
�
��
A simpler neighbor finding method:Select an arbitrary (fixed) node r.�For each pair of labeled nodes (i,j) let C(i,j) be defined
by the following expression (also see figure):
C(i,j)
i
j
r
Claim: Let i, j be such that C(i,j) is maximized.Then i and j are neighboring leaves.
Claim: Let i, j be such that C(i,j) is maximized.Then i and j are neighboring leaves.
�
Neighbor Joining Algorithm� Set M to contain all leaves, and select a root r. |M|=L� If L =2, return a tree of two verticesIteration:� Choose i,j such that C(i,j) is maximal� Create a new vertex k, and update distances
� remove i,j, and add k to M� Recursively construct a tree on the smaller set.� When done, add i,j as children on k, at distances d(i,k) and d(j,k).
ij
k
m
[ ( , ) ( , ) ( , )] / 2( , ) ( , )
1for each other node ,
( , )(
[ ( , ) ( , ) (
, )
( , , )]2
)
d i j d i r d j r
d i j
d
d
i k
d j k
d
i k
m d i m d j m d jm ik
� � �
� �
� � �
��
Complexity of Neighbor Joining Algorithm
Naive Implementation:Initialization: �(L2) to compute the C(i,j)’ s.Each Iteration:�O(L) to update {C(i,k):i� L} for the new node k.�O(L2) to find the maximal C(i,j).Total of O(L3).
ij
k
m
��
��
Complexity of Neighbor Joining Algorithm
Using a Heap to store the C(i,j)’s:Initialization: �(L2) to compute and heapify the C(i,j)’ s.Each Iteration:�O(1) to find the maximal C(i,j).�O(L log L) to delete {C(m,i), C(m,j)} and add C(m,k) for
all vertices m.Total of O(L2 log L).(implementation details are omitted)
��
Reconstructing Trees from Additive Matrices
0E70D670C7470B74720AEDCBA
A
C
1
B
1
1
2
2D
E
3
3
Given a distance matrix constituting an additive metric, the topology of the corresponding additive tree is unique.
Q: Do we have to test additivity before running NJ?
A: This would be bad news, as this takes O(L4) time!
��
Reconstructing Trees from Additive Matrices
0E70D670C7470B74720AEDCBA
A
C
1
B
1
1
2
2D
E
3
3
Q: Do we have to test additivity before running NJ?
A: By Seito-Nei, if matrix is additive, NJ will construct the correct tree. Algorithm does not care about awareness and need not know anything about the matrix!
��
NJ Algorithm: Example
1
( , )n
ij
r d i j�
��
• Identify i,j� as neighbours if their divergence is minimal.
After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree. We do this with a simple average of distances:Dist[Spinach, MonHum]
If we happen to consider genes 1A, 2B, and 3A of species 1,2,3, we get a wrong tree that does not represent the ����������������������������������������������������������������������� ��������
7��������� �����������������������������������
-
--
�
Distance Based Reconstruction: We now move to character
In this approach, trees are constructed by comparing the characters of the corresponding species. Characters may be morphological (teeth structures, hip joint) or molecular (homologous DNA sequences). The most popular approaches are maximum parsimony (MP) and maximum likelihood (ML)
In both methods, we will assume independence of characters (no interactions). Each method has a well defined objective function. Goal is to find the tree or trees that optimize (maximize or minimize) respective function.
S(k,a)�the minimum score of subtree rooted at k when k has character a.
-;C��>
-;#��>
��
Evaluating Parsimony ScoresDynamic programming on a given treeInitialization:� For each leaf � set -;���>KL if � is labeled by �, otherwise -;���>K�
Iteration:� if # is node with children � and C, then -;#��>K��2;-;��2>@�;��2>>@���;-;C��>@�;���>>
Termination:� cost of tree is ��2-;��2> where � is the root
Comment:
To reconstruct an optimal assignment, we need to keep in each node k and for each character a the two characters x, y that bring about the minimum when k has character a.
��
Cost of Evaluating Parsimony for binary trees
If there are nodes, � characters, and #possible values for each character, then complexity is 8;�#?>�
Of course, we still need to search over possible trees and find the best one. One usually resorts to heuristic search techniques.
��
2. Perfect Phylogeny
Data on species is given by a Character State Matrix.Cell (p,i) has value j iff character i of object (species) p has state j.Goal: constructing evolution tree for the species.