Top Banner
Realistic evolutionary Realistic evolutionary models models Marjolijn Elsinga & Lars Hemel
48

Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Realistic evolutionary modelsRealistic evolutionary models

Marjolijn Elsinga & Lars Hemel

Page 2: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Realistic evolutionary modelsRealistic evolutionary models

Contents• Models with different rates at different sites• Models which allow gaps• Evaluating different models• Break• Probabilistic interpretation of Parsimony• Maximum Likelihood distances

Page 3: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Unrealistic assumptionsUnrealistic assumptions

1 Same rate of evolution at each site in the substitution matrix

- In reality: the structure of proteins and the base pairing of RNA result in different rates

2 Ungapped alignments

- Discard useful information given by the pattern of deletions and insertions

Page 4: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Different rates in matrixDifferent rates in matrix

Maximum likelihood, sites are independent

Xj for j = 1…n

Page 5: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Different rates in matrix (2)Different rates in matrix (2)

Introduce a site-dependent variable ru

Page 6: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Different rates in matrix (3)Different rates in matrix (3)

We don’t know ru, so we use a prior

Yang [1993] suggests a gamma distribution g(r, α , α), with mean = 1 and variance = 1/α

Page 7: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

ProblemProblem

Number of terms grows exponentially with the number of sequences computationally slow

Solution: approximation- Replace integral by a discrete sum- Subdivide domain into m intervals- Let rk denote the mean of the gamma

distribution in the kth interval

Page 8: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

SolutionSolution

Yang [1993] found m = 3.4 gives a good approximation

Only m times as much computation as for non-varying sites

Page 9: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Evolutionary models with gaps (1)Evolutionary models with gaps (1)

Idea 1: introduce ‘_’ as an extra character of the alphabet of K residues and replace the (KxK) matrix with a (K+1) x (K+1) matrix

Drawback: no possibility to assign lower cost to a following gap, gaps are now independent

Page 10: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Evolutionary models with gaps (2)Evolutionary models with gaps (2)

Idea 2: Allison, Wallace & Yee [1992] introduce delete and insertion states to ensure affine-type gaps

Drawback: computationally intractable

Page 11: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Evolutionary models with gaps (3)Evolutionary models with gaps (3)

Idea 3: Thorne, Kishino & Felsenstein [1992] use fragment substitution to get a degree of biological plausibility

Drawback: usable for only two sequences

Page 12: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

FinallyFinally

Find a way to use affine-type gap penalties in a computationally reasonable way

Mitchison & Durbin [1995] made a tree HMM which uses a profile HMM architecture, and treats paths through the model as objects that undergo evolutionary change

Page 13: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Assumptions needed againAssumptions needed again

We will use a architecture quite simpler than that of the profile HMM of Krogh et al [1994]: it has only match and delete states

Match state: Mk

Delete state: Dk k = position in the model

Page 14: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Tree HMM with gaps (1)Tree HMM with gaps (1)

Sequence y is ancestor of sequence xBoth sequences are aligned to the

model, so both follow a prescribed path through the model

Page 15: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Tree HMM with gaps (2)Tree HMM with gaps (2)

x emits residu xi at Mk

y emits residu yj at Mk

Probability of substitution yj xi is

P(xi| yj,t)

Page 16: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Tree HMM with gaps (3)Tree HMM with gaps (3)

What if x goes a different path than yx: Mk Dk+1 (= MD)

y: Mk Mk+1 (= MM)

P(MD|MM, t)

Page 17: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Tree HMM with gaps (4)Tree HMM with gaps (4)

x: Dk+1 Mk+2 (= DM)

y: Mk+1 Mk+2 (= MM)

We assume that the

choice between DD and

DM is controlled by a mutational process

that operates independently from y

Page 18: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Substitution matrixSubstitution matrix

The probabilities of transitions of the path of x are given by priors: Dk+1 Mk+2 has probability qDM

Page 19: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

How it worksHow it works

At position k: qyjP(xi|yj,t)

Transition k k+1:

qMMP(MD|MM,t)

Transition k+1 k+2:

qMMqDM

Page 20: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

An other exampleAn other example

Page 21: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Evaluating models: evidenceEvaluating models: evidence

Comparing models is difficultCompare probabilities:

P(D|M1) and P(D|M2) by integrating

over all parameters of each modelParameters θ Prior probabilities P(θ )

Page 22: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Comparing two modelsComparing two models

Natural way to compare M1 and M2 is to

compute the posterior probability of M1

Page 23: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Parametric BootstrapParametric Bootstrap

Let be the maximum likelihood of the data D for the model M1

Let be the maximum likelihood of the data D for the model M2

Page 24: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Parametric bootstrap (2)Parametric bootstrap (2)

Simulate datasets Di with the values of

the parameters of M1 that gave the

maximum likelihood for DIf Δ exceed almost all values of Δi

M2 captured more aspects of the data

that M1 did not mimic, therefore M1 is

rejected

Page 25: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

BreakBreak

Page 26: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Probabilistic interpretation of Probabilistic interpretation of various modelsvarious models

Lars Hemel

Page 27: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

OverviewOverviewReview of last week’s method Parsimony

– Assumptions, PropertiesProbabilistic interpretation of ParsimonyMaximum Likelihood distances

– Example: Neighbour joiningMore probabilistic interpretations

– Sankoff & Cedergren– Hein’s affine cost algorithm

Conclusion / Questions?

Page 28: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

ReviewReview

Parsimony = Finding a tree which can explain the observed sequences with a minimal number of substitutions

Page 29: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

ParsimonyParsimony

Remember the following assumptions:– Sequences are aligned– Alignments do not have gaps– Each site is treated independently

Further more, many families have:– Substitution matrix is multiplicative:– Reversibility: ba qtbaPqtabP ),|(),|(

S(t)S(s)s)S(t

Page 30: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

ParsimonyParsimony

Basic step: counting the minimal number of changes for one site

Final number of substitutions is summing over all the sites

Weighted parsimony uses different ‘weights’ for different substitutions

Page 31: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Probabilistic interpretation of Probabilistic interpretation of parsimonyparsimony

Given: A set of substitution probabilities P(b|a) in which we neglect the dependence on length

t Calculate substitution costs

S(a,b) = -log P(b|a) Felsenstein [1981] showed that by using these

substitution costs, the minimal cost at site u for the whole tree T obtained by the weighted parsimony algorithm is regarded as an approximation to the likelihood

Page 32: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Probabilistic interpretation of Probabilistic interpretation of parsimonyparsimony

Testing performance for tree-building algorithms can be done by generating trees probabilistic with sampling and then see how often a given algorithm reconstructs them correctly

Sampling is done as follows:– Pick a residue a at the root with probability – Accept substitution to b along the edge down to node i

with probability repetitive– Sequences of length N are generated by N independent

repetitions of this procedure– Maximum likelihood should reconstruct the correct tree

for large N

aq

),|( itabP

Page 33: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Probabilistic interpretation of Probabilistic interpretation of parsimonyparsimony

Suppose we have tree T, with the following edgelengths

0.09

0.1

0.1

0.3

0.3T

pp

pp

1

1

And substitutionmatrix with p=0.3 for leaves 1,3 and p=0.1 for 2 and 4

1

2

4

3

Page 34: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Probabilistic interpretation of Probabilistic interpretation of parsimonyparsimony

Tree with n leaves has (2n-5)!! unrooted trees

1

2

3

41T

1

2

4

33T

1 2

3 42T

Page 35: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Probabilistic interpretation of Probabilistic interpretation of parsimonyparsimony

Parsimony can constructs the wrong tree even for large N

N

20 419 339 242

100 638 204 158

500 904 61 35

2000 997 3 0

1T 2T 3T N

20 396 378 224

100 405 515 79

500 404 594 2

2000 353 646 0

1T 2T 3TParsimonyMaximum likelihood

Page 36: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Probabilistic interpretation of Probabilistic interpretation of parsimonyparsimony

Suppose the following example: A tree with A,A,B,B at the places 1,2,3 and 4

A

A

B

B

Page 37: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Probabilistic interpretation of Probabilistic interpretation of parsimonyparsimony

With parsimony the number of substitutions are calculated

A A

B B

A

A

B

B

A

A

A

B2 1

Parsimony constructs the right tree with 1 substitution more often than the left tree with 2

Page 38: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Maximum Likelihood distancesMaximum Likelihood distances

Suppose tree T, edge lengths and sampled sequences at the leafs

We’ll try to compute the distance between and

1x

2x

3x4x 5x

6x

8x

1t 3t

6t 7t

4t

7x

nttt ,,1 ix

1x3x

Page 39: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

By multiplicativety

Maximum Likelihood distancesMaximum Likelihood distances

1x

2x

3x4x 5x

6x

8x

1t 3t

6t 7t

4t

7x

6

),|(),|(),|( 686

161

6181

a

taaPtaaPttaaP

Page 40: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

8

8),|(),|( 3783

6181

aa

qttaaPttaaP

By reversibility and multiplicativity

1x

2x

3x4x 5x

6x

8x

1t 3t

6t 7t

4t

7x

1x3x

61 tt 37 tt

1x

3x

3761 tttt

3

8

3

),|(

),|(),|(

736131

7338

6181

a

aa

qttttaaP

qttaaPttaaP

Page 41: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

u

ju

iux

t

MLij txxPqd j

u),|(maxarg

u

ju

iu

t

MLij txxPd ),|(maxarg

),|(),|,(1 r

ju

kkj

uiux

ju

iu ttxxPqtTxxP

Maximum Likelihood distancesMaximum Likelihood distances

Page 42: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Maximum Likelihood distancesMaximum Likelihood distances

ML distances between leaf sequences are close to additive, given large amount of data

rkkMLij ttd

1

Page 43: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Example: Neighbour joiningExample: Neighbour joining

i

j

k

m

kmjkjmkmikim dddddd ,

ijjmimkm dddd 2

1

Page 44: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Example: Neighbour joiningExample: Neighbour joining

Use Maximum Likelihood distancesSuppose we have a multiplicative reversible

modelSuppose we have plenty of data The underlying probabilistic model is

correctThen Neighbour joining will construct any

tree correctly.

Page 45: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Example: Neighbour joiningExample: Neighbour joining Neighbour joining using ML distances

It constructs the correct tree where Parsimony failed

N

20 477 301 222

100 635 231 134

500 896 85 19

2000 997 5 0

1T 2T 3T

Page 46: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

More probabilistic interpretationsMore probabilistic interpretationsSankoff & Cedergren

– Simultaneously aligning sequences and finding its phylogeny, by using a character substitution model

– Probabilistic when scores are interpreted as log probabilities and if the procedure is additive in stead of maximizing.

Allison, Wallace & Yee [1992] – But as the original S&C method it is not

practical for most problems.

Page 47: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

More probabilistic interpretationsMore probabilistic interpretationsHein’s affine cost algorithm

– Simultaneously aligning sequences and finding its phylogeny, by using affine gap penalties

– Probabilistic when scores are interpreted as log probabilities and if the procedure is additive in stead of maximizing.

– But when using plus in stead of max we have to include all the paths, which will cost at the first node above the leaf and at the next and so on. So all the speed advantages are gone.

2N3N

Page 48: Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

ConclusionConclusion Probabilistic interpretations can be better

– Compare ML with parsimony They can also be less useful, because of costs

which get too high– Sankoff & Cedergren

Neighbour joining constructs the correct tree if it has the correct assumptions

So, the trick is to know your problem and to decide which method is the best

Questions??