Top Banner
Elze de Groot & Anastas ia Berdnikova 1 Multiple Alignment by Multiple Alignment by profile HMM training profile HMM training and and Phylogenetic Trees Phylogenetic Trees Elze de Groot & Anastacia Berdnikova
50

Multiple Alignment by profile HMM training and Phylogenetic Trees

Dec 30, 2015

Download

Documents

curran-weaver

Multiple Alignment by profile HMM training and Phylogenetic Trees. Elze de Groot & Anastacia Berdnikova. Topics. Multiple alignment with known HMM HMM training from unaligned sequences Avoiding local maxima Simulated annealing Noise injection Stochastic sampling traceback algorithm - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova

1

Multiple Alignment by profile Multiple Alignment by profile HMM trainingHMM training

andandPhylogenetic TreesPhylogenetic Trees

Elze de Groot

&

Anastacia Berdnikova

Page 2: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 2

TopicsTopics

Multiple alignment with known HMMHMM training from unaligned sequencesAvoiding local maxima

– Simulated annealing– Noise injection– Stochastic sampling traceback algorithm

Model surgeryPhylogenetic trees

Page 3: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 3

Multiple alignment with known Multiple alignment with known profile HMMprofile HMM

Multiple alignment and model known -> align large number of other family members

Calculating Viterbi alignment for every sequence

Residues in same match state are aligned in columns

That´s a difference between profile HMM and traditional multiple alignment

Page 4: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 4

ExampleExampleModel estimated from an alignment

Page 5: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 5

Example continuedExample continuedThe most probable paths and alignment

Page 6: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 6

Profile HMM training from Profile HMM training from unaligned sequencesunaligned sequences

Algorithm:

Page 7: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 7

Initial ModelInitial Model

Choose length of model

- M is number of match states

- set M to be the average length

Choose initial models carefullyRandomness in choice of initial model

Page 8: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 8

Parameter EstimationParameter Estimation

Use forward and backward variables to re-estimate emission and transition probability parameters

Baum-Welch re-estimation can be replaced by viterbi alternative

Page 9: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 9

Forward AlgorithmForward Algorithm

Page 10: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 10

Backward algorithmBackward algorithm

Page 11: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 11

Baum-Welch re-estimation Baum-Welch re-estimation equationsequations

Expected emission counts from sequence x

)()()(

1)(

)()()(

1)(

|

|

ibifxP

aE

ibifxP

aE

k

i

kk

k

i

kk

Iaxi

II

Maxi

MM

Page 12: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 12

Baum-Welch re-estimation Baum-Welch re-estimation equationsequations

Expected transition counts from sequence x

)1()()(

1

)1()1()()(

1

)1()1()()(

1

111

1111

ibaifxP

A

ibxeaifxP

A

ibxeaifxP

A

kkkkkk

kkikkkk

kkkkkkk

Di

DXXDX

Ii

iIIXXIX

Mi

iMMXXMX

Page 13: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 13

Avoiding local maximaAvoiding local maxima

Baum-Welch guaranteed to find local maxima

Not guaranteed it is anywhere near global optimum or biologically reasonable solution

Reason: models are long -> many options to get wrong solution

Page 14: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 14

Avoiding local maximaAvoiding local maxima

Use stochastic search algorithm

Commonly used: Simulated annealing

Page 15: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 15

Simulated annealingSimulated annealing

Some compounds only cristallise if they are slowly annealed from high to low temperature

Optimisation problem: minimise function ´energy´ E(x)

Maximising function same as minimising negative value of function

Page 16: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 16

Simulated annealing (2)Simulated annealing (2) ´temperature´ T Probability of ´state´ x is given by Gibbs

distribution

Partition function:

x usually multidimensional so impossible to calculate Z

E(x)

TZP(x)

1exp

1

dxxET

Z

)(

1exp

Page 17: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 17

Simulated annealing (3)Simulated annealing (3)

T0, all configurations except with lowest energy are prob 0 (system is ´frozen´)

T, All configuration have same prob (system is ´molten´)

With crystallisation: minimum can be found by sampling this distribution at high temperature first and then decreasing temperatures

Page 18: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 18

Simulated annealing for HMMSimulated annealing for HMM

Natural energy function negative log of likelihood –logP(data|)

Non-trivial, the two methods I´m going to mention are approximations

´´)|data(

)|data()|data(

1)|data(log

1exp

1/1

/1/1

dP

PP

ZP

TZ T

TT

Page 19: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 19

Noise injectionNoise injection

Adding noise to counts estimated in forward-backward procedure and let size of noise decrease slowly

In Krogh et al.[1994] the noise was generated by a random walk in the initial model

Page 20: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 20

Simulated annealing Viterbi Simulated annealing Viterbi estimationestimation

If there are N sequences, there´s an exact translation from the N paths 1,…, N to the parameters of the model

Treat the paths as fundamental parameters in which to maximise the likelihood

Simulated annealing done in these variables instead of the model parameters

Page 21: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 21

Simulated annealing Viterbi Simulated annealing Viterbi estimationestimation

Denominator is Z, the partition function -> sum over all paths

Can be obtained by modified forward algorithm using exponentiated transmission and emission parameters

´

/1

/1

)|´,()|,(

)(Prob

T

T

xPxP

Page 22: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 22

Simulated annealing Viterbi Simulated annealing Viterbi estimationestimation

Exponentiated transmission parameter– âij = aij

1/T

Exponentiated emission parameter– êj(x) = ej(x)1/T

Used in place of unmodified probability parameters in forward algorithm

Z is result of forward algorithm

Page 23: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 23

Simulated annealing Viterbi Simulated annealing Viterbi estimationestimation

Algorithm: Stochastic sampling traceback algorithm for HMMs

k ikiiii iiiiâfâf ,,1,,11 /|Prob

11

Initialisation: πL+1 = End.Recursion: for L+1 ≥ i ≥ 1,

Page 24: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 24

Simulated annealing Viterbi vs Simulated annealing Viterbi vs ViterbiViterbi

Key difference:

Viterbi selects highest probable path for each sequence

Simulated annealing samples each path according to the likelihood of the path

Page 25: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 25

Model SurgeryModel Surgery

During training a model two things can happen:

(a) some match states are redundant and should be absorbed in insert state

(b) one or more insert states aborb too much sequence, in which case they should be expanded

Page 26: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 26

Model SurgeryModel Surgery

How much is a certain transition used by training sequences

Usage of match state is sum of counts for all letters in state

Page 27: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 27

Model surgeryModel surgery

If match state is used by less than ½ sequences -> delete module

If more than ½ of sequences use the transitions into an insert state, this is expanded to new modules

Page 28: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 28

Model surgery – Example Model surgery – Example SAMSAM

I tried a sequence in SAM with and without model surgery

Same 7 sequences as in example beforeParameters <cutinsert 0.25> <cutmatch

0.5> -> delete any match state used by fewer than half the sequences, and insert match states for any insert node used by greater than one quarter of the sequences

Page 29: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 29

Model surgery – Example Model surgery – Example SAMSAM

Without model surgery>seq1FPHFD.....L...S.....-HGSAQ>seq2FESFG.....D...LstpdaVMGNPK>seq3FDRFKhlkteA...E.....MKASED>seq4FTQFA.....G...Kdles.IKGTAP>seq5FPKFK.....G...LttadqLKKSAD>seq6FSFLK.....GtseV.....PQNNPE>seq7FGFSG.....A...-.....--SDPG

With model surgery>seq1FPHF.DLS-..-..--HGSAQ>seq2FESF.GDLStpD..AVMGNPK>seq3FDRF.KHLK..TeaEMKASED>seq4FTQFaGKDL..E..SIKGTAP>seq5FPKF.KGLTtaD..QLKKSAD>seq6FSFL.KGTS..E..VPQNNPE>seq7FGFS.G---..-..--ASDPG

Page 30: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova

30

Building phylogenetic treesBuilding phylogenetic trees

Page 31: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 31

OverviewOverview

The tree of life – descriptionBackground on trees

Page 32: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 32

Multiple alignment and treesMultiple alignment and trees

Alignment of sequences should take account of their evolutionary relationship. [Sankoff, Morel & Cedergren, 1973]

Several progressive alignment algorithms use a ‘guide tree’ (to guide the clustering process).

We begin to build trees.

Page 33: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 33

The tree of lifeThe tree of life

The similarity of molecular mechanisms of the organisms that have been studied strongly suggests that all organisms on Earth had a common ancestor. Thus any sets of species is related, and this relationship is called a phylogeny.

Usually the relationship can be represented by a phylogenetic tree.

Page 34: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 34

Zuckerkandl & Pauling’s paper [1962] showed that molecular sequences provide sets of morphological characters that can carry a large amount of information.

An assumption: the sequencies we want to analyze on the phylogeny matter have descended from some common ancestral gene in a common ancestral species.

Gene duplication exists => we have to check the assumption carefully.

Page 35: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 35

Gene duplication and speciationGene duplication and speciation

By another mechanism, gene duplication, two sequences can also be separated and diverge from the common ancestor.

Genes which diverged because of speciation are called orthologues. Genes which diverged by gene duplication are called paralogues.

Page 36: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 36

A tree of orthologuesA tree of orthologues: alpha haemoglobins HBA_ACCGE, : alpha haemoglobins HBA_ACCGE, HBA_AEGMO, HBA_AILFU, HBA_AILME, HBA_ALCAA, HBA_AEGMO, HBA_AILFU, HBA_AILME, HBA_ALCAA, HBA_ALLMI, HBA_AMBME, HBA_ANAPL (SWISS-PROT).HBA_ALLMI, HBA_AMBME, HBA_ANAPL (SWISS-PROT).

Page 37: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 37

A tree of paraloguesA tree of paralogues: HBAT_HUMAN, HBAZ_HUMAN, : HBAT_HUMAN, HBAZ_HUMAN, HBA_HUMAN, HBB_HUMAN, HBD_HUMAN, HBE_HUMAN, HBA_HUMAN, HBB_HUMAN, HBD_HUMAN, HBE_HUMAN, HBG_HUMAN, MYG_HUMAN (SWISS-PROT).HBG_HUMAN, MYG_HUMAN (SWISS-PROT).

Page 38: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 38

Background on treesBackground on trees

All trees will be assumed to be binary (an edge that branches splits into two daughter edges).

Each edge of the tree has a certain amount of evolutionary divergence associated to it. We adopt the general term ‘length’, which will be represented by lengthes of edges on figures.

A true biological phylogeny has a ‘root’, or ultimate ancestor of all sequences.

Page 39: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 39

Rooted and unrooted treeRooted and unrooted tree

Page 40: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 40

A tree with a given labelling will be called a labelled branching pattern.

We refer to this as the tree topology and denote it by T.

Lengths of the edges: ti with a suitable numbering scheme for the is.

Page 41: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 41

Counting and labellingCounting and labelling

Rooted tree:– n leaves, plus (n-1) branch nodes in

addition to leaves -> we have 2n-1 nodes in all, and 2n-2 edges.

– leaves – 1..n, branch nodes – n+1 .. 2n-1, (2n-1)th node is root.

Page 42: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 42

Counting and labellingCounting and labelling

Unrooted tree:– n leaves, 2n-2 nodes and 2n-3 edges.– a root can be added at any of its

edges => we can get 2n-3 rooted trees.

Page 43: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 43

Number of rooted and unrooted Number of rooted and unrooted treestrees

A root can be added at any edge, producing 2n-3 rooted trees from unrooted tree => there are (2n-3) times as many rooted trees as unrooted trees, for a given number n of leaves.

Page 44: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 44

Instead of the root, we can add an extra edge or ‘branch’ with a distinct label in its leaf.

Page 45: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 45

● There are three such trees with (2n-3)=5 leaves – they are distinct labelled branching patterns.

● There are then five ways of adding a further branch labelled with a distinct label (‘5’), giving in all 3x5=15 unrooted trees with five leaves.

● The number of unrooted trees with n leaves is equal to 3*5*...*(2n-5) = (2n-5)!! So, we have (2n-3)!! rooted trees with n leaves.

Page 46: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova

46

Building phylogenetic treesBuilding phylogenetic trees

Questions?

Page 47: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 47

Exercise 7.2Exercise 7.2

The trees with three and four leaves in Figure 7.3 all have the same unlabelled branching pattern. For both rooted and unrooted trees, how many leaves do there have to be to obtain more than one unlabelled branching pattern? Find a recurrence relation for the number of rooted trees. (Hint: consider the trees formed by joining two trees at their root).

Page 48: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 48

Exercise 7.2Exercise 7.2

Page 49: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 49

Exercise 7.3Exercise 7.3

All trees considered so far have been binary, but one can envisage ternary trees that, in their rooted form, have three branches descending from a branch node. If there are m branch nodes in an unrooted ternary tree, how many leaves are there and how many edges?

Page 50: Multiple Alignment by profile HMM training and Phylogenetic Trees

Elze de Groot & Anastasia Berdnikova 50

Exercise 7.4Exercise 7.4

Consider next a composite unrooted tree with m ternary branch nodes and n binary branch nodes. How many leaves are there, and how many edges? Let Nm,n denote the number of distinct labelled branching patterns of this tree. Extend the counting argument for binary trees to show that

Nm,n = (3m+2n-1)N m,n-1 + (n+1)N m-1,n+1

(Hint: the first term after the ‘=’ counts the number of ways that a new edge can be added to an existing edge, thereby creating an additional binary node; the second term corresponds to edges added at binary nodes, thereby producing ternary nodes.)