Top Banner
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University
55

Building phylogenetic trees

Jan 01, 2016

Download

Documents

armand-fowler

Building phylogenetic trees. Jurgen Mourik & Richard Vogelaars Utrecht University. Overview. Background Making a tree from pairwise distances; Parsimony; ; Assessing the trees: the bootstrap; Simultaneous alignment and phylogeny; Application: Phylip. Background. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Building phylogenetic trees

Building phylogenetic trees

Jurgen Mourik &

Richard VogelaarsUtrecht University

Page 2: Building phylogenetic trees

Building phylogenetic trees2

Overview

• Background

• Making a tree from pairwise distances;

• Parsimony;– <break>;

• Assessing the trees: the bootstrap;

• Simultaneous alignment and phylogeny;

• Application: Phylip

Page 3: Building phylogenetic trees

Building phylogenetic trees3

Background

• Phylogenetic tree: diagram showing evolutionary lineages of species/genes

• Trees are used:– To understand lineage of various species– To understand how various functions evolved– To inform multiple alignments

Page 4: Building phylogenetic trees

Building phylogenetic trees4

Phylogenetic tree approaches

• Distance:– UPGMA– Neighbour-joining

• Parsimony:– Traditional parsimony– Weighted parsimony

Page 5: Building phylogenetic trees

Building phylogenetic trees5

Making a tree from pairwise distances

• Given a set of sequences you want to build a tree.

• Compute the distances dij between each pair i, j of the sequences.

• There are many different distance measures.

• Average distance between pairs of sequences from each cluster.

Page 6: Building phylogenetic trees

Building phylogenetic trees6

UPGMA

• Unweighted Pair Group Method using arithmetic Averages.

• It works by clustering the sequences, at each stage combining two clusters and at the same time creating a new node in a tree, using a distance measure.

Page 7: Building phylogenetic trees

Building phylogenetic trees7

Distance between points

• |Ci| and |Cj| denote the number of sequences in clusters i and j.

ji , q in Cp in C

pq

ji

ij dCC

d1

3

2 4

i

l

j

411

1 )(d

*d ilil

Page 8: Building phylogenetic trees

Building phylogenetic trees8

Distance between clusters

• Let Ck be the union of clusters Ci and Cj,then dkl

• Where Cl is any other cluster.

ji

jjliil

klCC

CdCdd

3

4k

l

5.32

7

11

1*31*4

kld

i

j

Page 9: Building phylogenetic trees

Building phylogenetic trees9

Building the tree: UPGMA

Initialisation:

Assign each sequence i to its own cluster Ci,

Define one leaf of T for each sequence, and place at height zero.Iteration:

Determine the two clusters i, j for which dij is minimal.

Define a new cluster k by , and define dkl for all l.

Define a node k with daughter nodes i an j, and place it at height dij /2.

Add k to the current clusters and remove i and j.Terminiation:

When only two clusters i, j remain, place the root at height dij /2.

jik CCC

Page 10: Building phylogenetic trees

Building phylogenetic trees10

UPGMA: Initialisation

Page 11: Building phylogenetic trees

Building phylogenetic trees11

UPGMA: Iteration 1

Page 12: Building phylogenetic trees

Building phylogenetic trees12

UPGMA: Iteration 2

Page 13: Building phylogenetic trees

Building phylogenetic trees13

UPGMA: Iteration 3

Page 14: Building phylogenetic trees

Building phylogenetic trees14

UPGMA: Terminiation

Page 15: Building phylogenetic trees

Building phylogenetic trees15

Properties of UPGMA

• Molecular clock & ultrametric property of distances

• Additivity

Page 16: Building phylogenetic trees

Building phylogenetic trees16

Properties of UPGMA:Molecular clock & ultrametric

• The molecular clock assumption: divergence of sequences is assumed to occur at the same rate at all points in the tree.

• If this does holds, then the data is said to be ultrametric.

Page 17: Building phylogenetic trees

Building phylogenetic trees17

Properties of UPGMA:Additivity

• Given a tree, its edge lengths are said to be additive if the distance between any pair of leaves is the sum of the lengths of the edges on the path connecting them.

j

i

m

k

)(21

ijjmimkm

jkikij

kmjkjm

kmikim

dddd

ddd

ddd

ddd

Page 18: Building phylogenetic trees

Building phylogenetic trees18

Neighbour-joining

• N-j constructs a tree by iteratively joining subtrees (like UPGMA).

• Produces an unrooted tree.

• Doesn’t make the molecular clock assumption, therefore the ultrametric property does not hold.

Page 19: Building phylogenetic trees

Building phylogenetic trees19

Distances in Neighbour-joining

• Given a new internal node k, the distance to another node m is given by:

)dd(dd ijjmimkm 21

)dd(dd jmimijik 21

ikijjk ddd j

i mk

Page 20: Building phylogenetic trees

Building phylogenetic trees20

Distances in Neighbour-joining

• Generalizing this so that the distance to all other leaves are taken into account:

• Where

• And |L| denotes the size of the set L of leaves.

)rr(dd jiijik 21

Lm

imi dL

r2

1j

i mk

Page 21: Building phylogenetic trees

Building phylogenetic trees21

Building the tree:Neighbour-joining

Initialisation:Define T to be the set of leaf nodes, one for each given sequence, and put L=T.

Iteration:Pick a pair i, j in L for which defined by is minimal.Define a new node k and set , for all m in L.Add k to T with edges of lengths , joining k to i and j, respectively.Remove i and j from L and add k.

Termination:When L consists of two leaves i and j add the remaining edge between i and j, with length dij.

)rr(dd jiijik 21

ikijjk ddd )dd(dd ijjmimkm 2

1

)( jiijij rrdD

Lm

imi dL

r2

1

Page 22: Building phylogenetic trees

Building phylogenetic trees22

Rooting trees

• Finding a root in an unrooted tree is sometimes accomplished by using an outgroup:– A species known to be more

distantly related to remaining species than they are to each other

• The point where the outgroup joins the rest of the tree is the best candidate for root position j

i

m

k

outgroup

Candidateroot

l

Page 23: Building phylogenetic trees

Building phylogenetic trees23

Comments on distance based methods

• If the given data is ultrametric (and these distances represent real distances), then UPGMA will identify the correct tree.

• If the data is additive (and these distances represent real distances), then Neighbour-joining will identify the correct tree.

• Otherwise, the methods may not recover the correct tree, but they may still be reasonable heuristics.

Page 24: Building phylogenetic trees

Building phylogenetic trees24

Phylogenetic tree approaches

• Distance:– UPGMA– Neighbour-joining

• Parsimony:– Traditional parsimony– Weighted parsimony

Page 25: Building phylogenetic trees

Building phylogenetic trees25

Parsimony

• Most widely used tree building algorithm(?).• Finds the tree that explains the data with a

minimal number of changes.• Instead of building a tree, it assigns a cost to a

given tree.• Two components of the parsimony algorithm can

be distinguished:– The computation of a cost for a given tree;– A search through all trees, to find the overall

minimum of this cost.

Page 26: Building phylogenetic trees

Building phylogenetic trees26

Parsimony example

• Given the following sequences: AAG,AAA,GGA,AGA.

• Several trees could explain the phylogeny

Page 27: Building phylogenetic trees

Building phylogenetic trees27

Traditional Parsimony

• Count the number of substitutions

• At each node keep:– a list of minimal cost residues– the current cost

• Post-order traversal of the tree

Page 28: Building phylogenetic trees

Building phylogenetic trees28

Traditional Parsimony

Initialisation:Set current cost C=0 and k =2n-1, the number of the root node.

Recursion: To obtain the set Rk:If k is a leaf node:

SetIf k is not a leaf node:

Compute Ri , Rj for the daughter i, j of k, and set if this intersection is not empty, or else

set and increment C.Termination:

Minimal cost of tree = C.

kuk xR

jik RRR jik RRR

Page 29: Building phylogenetic trees

Building phylogenetic trees29

Weighted Parsimony

• Extension of the traditional parsimony.

• Adds a cost function S(a,b) for each substitution of a by b.

• Post-order traversal of the tree

• Aim is now to minimize the cost.

Page 30: Building phylogenetic trees

Building phylogenetic trees30

Weighted Parsimony

Initialisation:Set k =2n-1, the number of the root node

Recursion: Compute Sk(a) for all a as follows:If k is a leaf node:

Set , otherwiseIf k is not a leaf node:

Compute Si(a), Sj(a) for all a at the daughter i, j and define

Termination:

Minimal cost of tree = minaS2n-1(a).

)),()((min)),()((min)( baSbSbaSbSaS jbibk

)( ,for )( aSxaaS kkuk

Page 31: Building phylogenetic trees

Building phylogenetic trees31

Break

• Questions so far?

• After the break:– Assessing the trees: the bootstrap;– Simultaneous alignment and phylogeny;– Application: Phylip

Page 32: Building phylogenetic trees

Building phylogenetic trees32

Branch and bound

• Parsimony itself can not build a tree!

• Using simple enumeration methods the number of trees become very large very fast.

• How to build the trees?– Stochastically– Branch and bound

Page 33: Building phylogenetic trees

Building phylogenetic trees33

Branch and bound

• B&B uses the parsimony algorithm.

• It guarantees to find the overall best tree.

• It systematically builds trees by increasing the number of leaves.

• Abandons a particular avenue of tree building whenever the current incomplete tree (T*) has a cost(T*)>cost(Tmin).

Page 34: Building phylogenetic trees

Building phylogenetic trees34

The Bootstrap

• A measure how much a tree should be trusted.

• Use the bootstrap as a method of assessing the significance of some phylogenetic feature.

Page 35: Building phylogenetic trees

Building phylogenetic trees35

The Bootstrap (2)

• The bootstrap works as follows:– Given a dataset of an alignment of sequences.– Generate an artificial dataset of the same size as the original

dataset by picking columns from the alignment at random with replacement.

– Apply the tree building algorithm to this artificial dataset.– Repeat selection and tree building procedure n times.– The feature with which a chosen phylogenetic features

appears is taken to be a measure of the confidence we can have in this feature.

Page 36: Building phylogenetic trees

Building phylogenetic trees36

Simultaneous alignment and phylogeny

• Simultaneously aligning sequences and finding a plausible phylogeny:– Sankoff & Cedergren’s gap-substitution algorithm;– Hein’s affine cost algorithm.

• Both find an optimal alignment given a tree.

Page 37: Building phylogenetic trees

Building phylogenetic trees37

Sankoff & Cedergren’s gap-substitution algorithm

• Guarantees to find ancestral sequences, and alignments of them and the leaf sequences.

• It uses a character-substitution model of gaps

• Together this minimizes a tree-based parsimony-type cost.

• The algorithm is a combination of two known methods:– Dynamic programming method (Chapter 6);– Weighted Parsimony algorithm.

Page 38: Building phylogenetic trees

Building phylogenetic trees38

Hein’s affine cost algorithm

• It uses affine gap penalties.

• Faster than the Sankoff & Cedergren algorithm.

• The aim is to find sequences z at a given node aligned to both of the sequences x and y at the daughter nodes satisfying:

• Where S is the total cost for a given alignment of two sequences. (mismatch cost =1 and 0 otherwise)

),(),(),( yxSyzSzxS

Page 39: Building phylogenetic trees

Building phylogenetic trees39

Hein’s affine cost algorithm

• Compared to equation (2.16) (alignment with affine gap scores) here the algorithm searches for the minimal cost path.

• The affine gap cost for a gap of length k isd+(k-1)e, where e<=d.

ejiV

djiVjiV

ejiV

djiVjiV

yxSjiV

yxSjiV

yxSjiV

jiV

Y

MY

X

MX

iiY

iiX

iiM

M

)1,(

)1,(min),(

),1(

),1(min),(

),()1,1(

),()1,1(

),()1,1(

min),(

Page 40: Building phylogenetic trees

Building phylogenetic trees40

Dynamic programming matrix for two sequences

VM

VX

VY

d=2

e=1

i

j

Page 41: Building phylogenetic trees

Building phylogenetic trees41

Hein’s affine cost algorithm

• Find the z for whichis minimal.

• From the matrix follows: – C - - A C -– C A C - - -

• CAC could be possible z.

),(),(),( yxSyzSzxS

CAC(?)

CAC CTCACA

Page 42: Building phylogenetic trees

Building phylogenetic trees42

Hein’s affine cost algorithmCAC(?)

CAC CTCACA

CACACA(?)

CAC CTCACA

CACAC(?)

CAC CTCACA

Which z could serve best as

ancestor?

Page 43: Building phylogenetic trees

Building phylogenetic trees43

Hein’s affine cost algorithm

CAC

CACACA

CACAC

12),(

0),(

edCTCACACACS

CACCACS12),( edCTCACACACS

1),(

2),(

CTCACACACACAS

edCACCACACAS12),( edCTCACACACS

1),(

),(

dCTCACACACACS

edCACCACACS12),( edCTCACACACS

Page 44: Building phylogenetic trees

Building phylogenetic trees44

Sequence graph

• Follow a path through the dynamic programming matrix.

• Derive a graph from this matrix.

• Whenever a cell is used by an optimal path a vertex is added to the graph.

Page 45: Building phylogenetic trees

Building phylogenetic trees45

Sequence graph

Graph 1

Page 46: Building phylogenetic trees

Building phylogenetic trees46

Sequence graph:line arrangement

Graph 1

Graph 2

Page 47: Building phylogenetic trees

Building phylogenetic trees47

Sequence graph:replacing the dummy edges

Graph 2

Graph 3

Page 48: Building phylogenetic trees

Building phylogenetic trees48

Dynamic Programming matrix:TAC – Graph 3

Page 49: Building phylogenetic trees

Building phylogenetic trees49

Ancestors

• Possible ancestral sequences for the leaf sequences TAC, CAC and CTCACA given the tree shown.

• Derived from the sequence graphs.CAC

CTCACA

CACTAC

CAC

1

5

Page 50: Building phylogenetic trees

Building phylogenetic trees50

Limitations of Hein’s model

• Hein’s algorithm takes the minimal cost sequences at each node upward.

• This can fail to give the overall optimum.

• Suppose the cost for a gap of length k is:– 13+3(k-1)

• Mismatch:– 4

• Suppose the leaves G and GTT.

Page 51: Building phylogenetic trees

Building phylogenetic trees51

Limitations of Hein’s model

• A eligible ancestor of G and GTT would be themselves, since they both have a cost of 13+3=16.

• GT would not be eligible because of the total cost of 2*13=26.

• Now we want to branch to the ancestor of G and GTT and there is a third leave GT.– The total cost for ineligible GT would be lower than

for either G or GTT.

Page 52: Building phylogenetic trees

Building phylogenetic trees52

Application: PHYLIP (Phylogeny Inference Package)

• Many features, among:– Traditional (unrooted) parsimony – Branch and bound to find all most parsimonious

trees

Page 53: Building phylogenetic trees

Building phylogenetic trees53

Application: PHYLIP

• Test dataset:Jurgen AACGUGGCCAAAU

Alpha ACCGCCGCCAAAU

Beta AAGGUCGCCAAAC

Gamma CAUUUCGUCACAA

Delta GGUAUCUCGGCCU

Epsilon GAAAUCUCGAUCC

Richard GGGCUCUCGGCUC

Page 54: Building phylogenetic trees

Demo

Page 55: Building phylogenetic trees

Questions?