Intro to Phylogenetic Trees Lecture 6bchor/CG05/CG6-trees.pdfIntro to Phylogenetic Trees Lecture 6 Sections 7.1, 7.2, in Durbin et al. Chapter 17 in Gusfield Slides by Shlomo Moran

�

�

Intro to Phylogenetic TreesLecture 6

Sections 7.1, 7.2, in Durbin et al.Chapter 17 in Gusfield

Slides by Shlomo Moran and by Ydo Wexler. Modifications by Benny Chor �

Evolution

��

� ��

� ��

� � ��

�

The Tree of Life

Sour

ce: A

lber

tset

al

�

� ��!��" ��#��$%&$

Tree of life- a better picture

�

�

Primate evolution

� �� '��

�

Historical Note�Until mid 1950’s phylogenies were constructed by

experts based on their opinion (subjective criteria)

�Since then, focus on objective criteria for constructing phylogenetic trees� Thousands of articles in the last decades

� Important for many aspects of biology� Classification � Understanding biological mechanisms

�

Morphological vs. Molecular

�Classical phylogenetic analysis: morphologicalfeatures: number of legs, lengths of legs, etc.

�Modern biological methods allow to use molecularfeatures� Gene sequences� Protein sequences

�Analysis based on homologous sequences (e.g., globins) in different species

�

Morphological topology

(�� ) �� *��+ �, ��- �� (��)�� (��(��. ��/�� - �� 0 ��1��/��0 �� /��" ��1��23 ��#��2+ ��3 ��4 ��)��/��, ��- �� 3 ��5�#�5��" �� - ��)� � ��(�� 6� ��- �� #��" ��7��. �� #, ��" �� )�� 1��/�� - � ��+ ��" ��, �� + �� (��. ��8 ��5��

� ��

, ��

9 ��

)��

7��

: ��

;(��+ �<�� (��$&&=>

�

Rat QEPGGLVVPPTDA

Rabbit QEPGGMVVPPTDA

Gorilla QEPGGLVVPPTDA

Cat REPGGLVVPPTEG

From sequences to a phylogenetic tree

� �� ;��+ �� >�

�

DonkeyHorseIndian rhinoWhite rhinoGrey sealHarbor sealDogCatBlue whaleFin whaleSperm whaleHippopotamusSheepCowAlpacaPig

Little red flying foxRyukyu flying foxHorseshoe batJapanese pipistrelleLong-tailed batJamaican fruit-eating bat

Asiatic shrewLong-clawed shrew

MoleSmall Madagascar hedgehogAardvarkElephantArmadilloRabbitPikaTree shrewBonoboChimpanzeeManGorillaSumatran orangutanBornean orangutanCommon gibbonBarbary apeBaboon

White-fronted capuchinSlow lorisSquirrelDormouseCane-ratGuinea pigMouseRatVoleHedgehogGymnureBandicootWallarooOpossumPlatypus

5��

)��

)��

3�� $

" ��3�� ?

5��

)��+ ��@- ��

: ��1�� @- ��

Mitochondrial topology;(��5��#� ��>

��

Nuclear topology

Round Eared Bat

Flying Fox

Hedgehog

Mole

Pangolin

Whale

Hippo

Cow

Pig

Cat

Dog

Horse

Rhino

Rat

Capybara

Rabbit

Flying Lemur

Tree Shrew

Human

Galago

Sloth

Hyrax

Dugong

Elephant

Aardvark

Elephant Shrew

Opossum

Kangaroo

$

?

A

B

)��

� ��

)��

��

, ��

: ��

)��

5��

- ��@� ��

5��

5��

;��+ ��>

;(��5��#� ��>

��

Theory of Evolution

�Basic idea� speciation events lead to creation of different

species.� Speciation caused by physical separation into

groups where different genetic variants become dominant

�Any two species share a (possibly distant) common ancestor

�

��

Phylogenenetic trees

� Leafs - current day species� Nodes - hypothetical most recent common ancestors� Edges length - “time” from one speciation to the next

Aardvark Bison Chimp Dog Elephant

��

Types of Trees

A natural model to consider is that of rooted trees

CommonAncestor

��

Types of treesUnrooted tree represents the same phylogeny without

the root node

Depending on the model, data from current day species does not distinguish between different placements of the root.

��

�� Tree a

ab

Tree b

c

Tree c

3��

�

��

Positioning Roots in Unrooted Trees

�We can estimate the position of the root by introducing an outgroup: � a set of species that are definitely distant from all

the species of interest


Falcon

Proposed root

��

Type of Reconstruction

�Distance-based� Input is a matrix of distances between species� Can be fraction of residue they disagree on, or

alignment score between them, or …

�Character-based� Examine all characters (AAs or DNA bases).� Do not ``summarize’’ sequences or pairs of

sequences by a single number.� Major methods: Parsimony; Likelihood.

�

Two Approaches to Tree Construction

� �� / � ��*�� C��

� �� D � �� *��C��;� �C�� #��>�

We start with distance based methods, considering the following question:Given a set of species (leaves in a supposed tree), and distances between them – construct a phylogeny which best “fits” the distances.

�

Exact solution: Additive sets

Given a set M of L objects with an L×L distance matrix:� d(i,i)=0, and for i�j, d(i,j)>0� d(i,j)=d(j,i).� For all i,j,k it holds that d(i,k) � d(i,j)+d(j,k).

Can we construct a weighted tree which realizes these distances?

�

��

Additive Distances (cont)

We say that the set of distances M over L objects is additive if there is a tree T, L of its nodes correspond to the L objects, with positive weights on the edges, such that for all i,j,d(i,j) = dT(i,j), the length of the path from i to j in T.

Note: Sometimes the tree is required to be binary, and then the edge weights are required to be just non-negative.

��

Distances for three objectsare always additive:

For L=3, here is always a (unique) tree with one internal node (by simple linear algebra)

( , )( , )( , )

d i j a bd i k a cd j k b c

� �

� �

� ��

�

�

i

j

k

m

Thus0

21

�� )],(),(),([),( jidkjdkidmkdc

��

How about four objects?

Not all distance matrices with 4 objects are additive, evenif they satisfy triangle inequality.E.g., no tree realizes these distances:

0l

30k

220j

2220i

lkji

��

The Four Points ConditionTheorem: A set M of distances is additive iff any subset of four objects can be labeled i,j,k,l so that:

d(i,k) + d(j,l) = d(i,l) +d(k,j) � d(i,j) + d(k,l)

ik

lj

Proof:By inspecting the figure, additivity � 4 points condition...

We call (i,j),(k,l) the “split” of {i,j,k,l}.

�

��

4P Condition � Additivity:Induction on the number of objects, L.For L � 3 the condition is empty and tree exists. Consider L=4. Denote B = d(i,k) +d(j,l) = d(i,l) +d(j,k) � d(i,j) + d(k,l) = A

Let y = (B – A)/2 � 0 (length of internal edge).

Then the tree should look as follows:We want to find the distances a,b, c and f.

a b

i j

k

m

c

y

l

n

f

Again, an instance of linear algebra

��

Tree construction for L=4

ab

i

j

k

m

c

y

l

n

f

Construct the tree by the given distances as follows:1. Construct a tree for {i, j,k}, with internal vertex m2. Add vertex n ,d(m,n) = y3. Add edge (n,l), c+f=d(k,l)

n

f

n

f

n

fRemains to prove: d(i,l) = dT(i,l)d(j,l) = dT(j,l)

��

Proof for L=4

a

b

i

j

k

m

c

y

l

n

f

By the 4 points condition and the definition of y:d(i,l) = d(i,j) + d(k,l) +2y - d(k,j) = a + y + f = dT(i,l) (the middle equality holds since d(i,j), d(k,l) and d(k,j) are realized by the tree)d(j,l) = dT(j,l) is proved similarly.

��

Splits Approach to Proof: Intuition

i

j

k l

Suppose 4 points condition holds with strict inequality, >,for every four leaves.

This defines a (2,2) partition of every quartet.Can use 4 points condition to show all quartets are consistent.

This in turn used to construct tree (homework assignment).

Finally show tree distances agreewith original distances using linearAlgebra.

�

�

Linear Algebraic Approach : Induction�Remove L-th object from the set�By induction, there is a tree, T’, for {1,2,…,L-1}.�For each pair of labeled nodes (i,j) in T’, let aij, bij, cij

be defined by the following figure:

aij

bij

cij

i

j

L

mij

1[ ( , ) ( , ) ( , )]

2ijc d i L d j L d i j� � �

�

Induction step:�Pick i and j that minimize cij.�T is constructed by adding L (and possibly mij) to T’,as in the figure. Then d(i,L) = dT(i,L) and d(j,L) = dT(j,L)� Remains to prove: For each k � i,j: d(k,L) = dT(k,L).

aij

bij

cij

i

j

L

mij

T’

��

Induction step (cont.)� Let k i,j be an arbitrary node in T’ , and let n be the

branching point of k in the path from i to j. � By the minimality of cij , (i,j),(k,L) is not a split of {i,j,k,L}. � Assume WLOG that (i,L),(j,k) is a split of {i,j, k,L}.

aij

bij

cij

i

j

L

mij

T’

k

n

��

Induction step (end)Since (i,L),(j,k) is a split, by the 4 points condition

d(L,k) = d(i,k) + d(L,j) - d(i,j)d(i,k) = dT(i,k) and d(i,j) = dT(i,j) by induction, and d(L,j) = dT(L,j) by the construction.

Hence d(L,k) = dT(L,k).QED

aij

bij

cij

i

j

L

mij

T’

k

n

��

From Additive Distance to a Tree

By following the proof, the four point condition can be used to construct a tree from a distance matrix, or to decide that there is no such tree (namely that the distance is not additive).

But this algorithm will go over all quartets, resulting in O(L4) many steps for L species (too sllllllllllllow).

The most popular method for constructing trees for additive sets uses the neighbor joining approach.

��

Constructing additive trees:The neighbor joining problem

• Let i, j be sisters (neighboring leaves) in a tree, let k be their father, and let m be any other vertex.• Using eq. we can compute the distances from k to all other leaves.

This suggest the following method to construct tree from an additive distance matrix: 1. Find sisters i,j in the tree,2. Replace i,j by their father, k, and recursively construct a

tree T for the smaller set.3. Add i,j as children of k in T.

[ ( , ) ( , )( , ) ( , )]/ 2d i m dd j m d i jk m � ��

��

Neighbor FindingHow can we find from distances alone a pair of sisters

(neighboring leaves)? Closest nodes are not necessarily neighboring leaves.

A B

CD

Next, we show a way to find neighbors from distances.��

Neighbor Finding: Seitou & Nei method

Theorem (Saitou&Nei) Assume d is additive, with all tree edge weights positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are sistertaxa in the tree.

ij

kl

m

T1T2

is a leaf

For a leaf , le ( , )t . im

i r d i m� �

, ).: Let be two leaves (out of leaves in Definitiondivergenc ( ,Then their is e ( , ) ( ) /( 2) ) i j

i j L TD i j d i j r r L� � ��

The proof is rather involved, and will be skipped (no tears pls).

�

��

A simpler neighbor finding method:Select an arbitrary (fixed) node r.�For each pair of labeled nodes (i,j) let C(i,j) be defined

by the following expression (also see figure):

C(i,j)

i

j

r

Claim: Let i, j be such that C(i,j) is maximized.Then i and j are neighboring leaves.

)],(),(),([),( jidrjdridjiC ��21

�

Sisters Identification: Example

A B

CD

5 4 6

2025

)],(),(),([),( jidrjdridjiC ��21

Select arbitrarily r=A.C(B,C)=(15+25-30)/2=5C(B,D)=(15+34-31)/2=8C(C,D)=(25+34-49)/2=5

Claim: Let i, j be such that C(i,j) is maximized.Then i and j are neighboring leaves.

�

Neighbor Joining Algorithm� Set M to contain all leaves, and select a root r. |M|=L� If L =2, return a tree of two verticesIteration:� Choose i,j such that C(i,j) is maximal� Create a new vertex k, and update distances

� remove i,j, and add k to M� Recursively construct a tree on the smaller set.� When done, add i,j as children on k, at distances d(i,k) and d(j,k).

ij

k

m

[ ( , ) ( , ) ( , )] / 2( , ) ( , )

1for each other node ,

( , )(

[ ( , ) ( , ) (

, )

( , , )]2

)

d i j d i r d j r

d i j

d

d

i k

d j k

d

i k

m d i m d j m d jm ik

� � �

� �

� � �

��

Complexity of Neighbor Joining Algorithm

Naive Implementation:Initialization: �(L2) to compute the C(i,j)’ s.Each Iteration:�O(L) to update {C(i,k):i� L} for the new node k.�O(L2) to find the maximal C(i,j).Total of O(L3).

ij

k

m

��

��

Complexity of Neighbor Joining Algorithm

Using a Heap to store the C(i,j)’s:Initialization: �(L2) to compute and heapify the C(i,j)’ s.Each Iteration:�O(1) to find the maximal C(i,j).�O(L log L) to delete {C(m,i), C(m,j)} and add C(m,k) for

all vertices m.Total of O(L2 log L).(implementation details are omitted)

��

Reconstructing Trees from Additive Matrices

0E70D670C7470B74720AEDCBA

A

C

1

B

1

1

2

2D

E

3

3

Given a distance matrix constituting an additive metric, the topology of the corresponding additive tree is unique.

Q: Do we have to test additivity before running NJ?

A: This would be bad news, as this takes O(L4) time!

��

Reconstructing Trees from Additive Matrices

0E70D670C7470B74720AEDCBA

A

C

1

B

1

1

2

2D

E

3

3

Q: Do we have to test additivity before running NJ?

A: By Seito-Nei, if matrix is additive, NJ will construct the correct tree. Algorithm does not care about awareness and need not know anything about the matrix!

��

NJ Algorithm: Example

1

( , )n

ij

r d i j�

��

• Identify i,j� as neighbours if their divergence is minimal.

• Combine i,j into a new node u.

• update the distance matrix.

• If only 3 nodes are left – finish.

Let ri be the sum of distances

from i to every other node

Here, we use the divergence,

( , ) ( ) /( 2, ) )( i jD d i j r ri Lj � ��

i m

j n

0.1 0.1 0.1

0.40.4

k l

��

��

Distance Matrix

0665D

6033C

5302B

6320A

DCBA

17111011 �� DCBA rrrr

( , ) 8.5( , ) 8( , ) 8( , ) 7.5( , ) 8.5( , ) 8

D A BD A C

D A D

D B CD B D

D C D

� �

� �

� �

� �

� �

� �

U

BA

��

Distance Matrix

065.5D

603C

5.530U

DCU

5.1195.8 �� DCU rrr( , ) 5.75( , ) 4.5

( , ) 4.25

X U CX U D

X C D

� �

� �

� �

U

BA

Y

C

��

Distance Matrix

05.6D

5.60Y

DY

U

BA

Y

C

D

Z

�

Reconstructing Trees from non Additive Matrices

�� .��2��E

� � .�� 0 F

��(�� E

� �� 3��

��

�

Almost Additive Matrix

� ��2� ��G��H��2�� 2� ��

, ,,' '| | min{| |} mi

(n

2)

i j i ji j ed d d d

l e� ��

�

Atteson: If d’ is almost additive with respect to a tree T, then the output of NJ is a tree T’ with the same topology as T��

��

Distance Matrix

��

Unrooted Tree - NJ

Root

��

Output - NJ

Branch lengthis proportional

to distance

��

��

N-J Method produces an Unrooted, Additive tree

��

PAM Spinach Rice Mosquito Monkey HumanSpinach 0.0 84.9 105.6 90.8 86.3Rice 84.9 0.0 117.8 122.4 122.6Mosquito 105.6 117.8 0.0 84.7 80.8Monkey 90.8 122.4 84.7 0.0 3.3Human 86.3 122.6 80.8 3.3 0.0

What is required for the Neighbour joining method?

Distance matrix0. Distance Matrix

Neighbor-Joining MethodAn Example

��

5� + ��A �A ;" ��/ + �#��>��-� �I��C��" ��+ �#��+ �" �� I��

Mon-Hum

MonkeyHumanSpinachMosquito Rice

1. First Step

��

After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree. We do this with a simple average of distances:Dist[Spinach, MonHum]

= (Dist[Spinach, Monkey] + Dist[Spinach, Human])/2 = (90.8 + 86.3)/2 = 88.55

Mon-Hum

MonkeyHumanSpinach

2. Calculation of New Distances

��

��

PAM Spinach Rice Mosquito MonHumSpinach 0.0 84.9 105.6 88.6Rice 84.9 0.0 117.8 122.5Mosquito 105.6 117.8 0.0 82.8MonHum 88.6 122.5 82.8 0.0

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)

3. Next Cycle

�

PAM Spinach Rice MosMonHumSpinach 0.0 84.9 97.1Rice 84.9 0.0 120.2MosMonHum 97.1 120.2 0.0

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)Spin-Rice

4. Penultimate Cycle

�

PAM SpinRice MosMonHumSpinach 0.0 108.7MosMonHum 108.7 0.0

HumanMosquito

Mon-Hum

MonkeySpinachRice

Mos-(Mon-Hum)Spin-Rice

(Spin-Rice)-(Mos-(Mon-Hum))

5. Last Joining

��

Human

Monkey

MosquitoRice

Spinach

The result:Unrooted Neighbor-Joining Tree

��

��

Dangers of Paralogs

Speciation events

Gene Duplication

1A 2A 3A 3B 2B 1B

If we happen to consider genes 1A, 2B, and 3A of species 1,2,3, we get a wrong tree that does not represent the ��

7��

-

--

�

Distance Based Reconstruction: We now move to character

based methods

��

Character-based methodsfor constructing phylogenies

In this approach, trees are constructed by comparing the characters of the corresponding species. Characters may be morphological (teeth structures, hip joint) or molecular (homologous DNA sequences). The most popular approaches are maximum parsimony (MP) and maximum likelihood (ML)

In both methods, we will assume independence of characters (no interactions). Each method has a well defined objective function. Goal is to find the tree or trees that optimize (maximize or minimize) respective function.

��

1. Maximum Parsimony� ��J��J� � , �� , , � �� , � ��#��

�� J.��2��E

� � ��

� � ��

� � � � � �

� � �

21 1

Here, total #substitutions = 4

�� ! �� ;��>J5��#�� ;��>�� C�� ;��>��*��

��

��

Example ContinuedThere are many trees possible. For example:

� � ��

� � ��

� � � � � �

� � �

11

1

Total #substitutions = 3

� � ��

� � ��

� � � � � �

� � �

11 2

Total #substitutions = 4The left tree is preferred over the right tree.

� ��" � ��

��

Example With One Letter

�Suppose we have five species, such that three have ‘C’ and two ‘T’ at a specified position

�Minimal tree has one evolutionary change:

C

C

CC

C

T

T

T

T � C

��

Extension to Many Letters

�What is the parsimony score of


A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA

.��'��

�

Weighted Parsimony Scores

# �� " � ��

� �� ;��>�� ;��>KL��;��>K$��

��

�

Evaluating Weighted Parsimony Scores

Each position is independent and computed by itself.Use Dynamic Programming on a given tree.� if k is a node with children i and j, then

S(k,a) = minx(S(i,x)+c(a,x)) + miny(S(j,y)+c(a,y))

k

ij

-;��2>

S(k,a)�the minimum score of subtree rooted at k when k has character a.

-;C��>

-;#��>

��

Evaluating Parsimony ScoresDynamic programming on a given treeInitialization:� For each leaf � set -;��>KL if � is labeled by �, otherwise -;��>K�

Iteration:� if # is node with children � and C, then -;#��>K��2;-;��2>@�;��2>>@��;-;C��>@�;��>>

Termination:� cost of tree is ��2-;��2> where � is the root

Comment:

To reconstruct an optimal assignment, we need to keep in each node k and for each character a the two characters x, y that bring about the minimum when k has character a.

��

Cost of Evaluating Parsimony for binary trees

If there are nodes, � characters, and #possible values for each character, then complexity is 8;�#?>�

Of course, we still need to search over possible trees and find the best one. One usually resorts to heuristic search techniques.

��

2. Perfect Phylogeny

Data on species is given by a Character State Matrix.Cell (p,i) has value j iff character i of object (species) p has state j.Goal: constructing evolution tree for the species.

10011E01430D13323C12102B00211Ac5c4c3c2c1Object

Character

�

��

Motivation: Evolution Tree

7�� ;��>��

� ��J$�� ;��>

?�� ;��>

Intro to Phylogenetic Trees Lecture 6bchor/CG05/CG6-trees.pdfIntro to Phylogenetic Trees Lecture 6 Sections 7.1, 7.2, in Durbin et al. Chapter 17 in Gusfield Slides by Shlomo Moran

Documents