Top Banner
Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 1 / 69
72

Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Jul 23, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Dynamic ProgrammingPart III:

Global sequence alignment&

Scoring matrices

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 1 / 69

Page 2: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Outline

Global Sequence alignmentScoring matricesLocal Sequence alignment

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 2 / 69

Page 3: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

From LCS to Alignment: Change up the Scoring

The Longest Common Subsequence (LCS) problem-the simplest formof sequence alignment - allows only insertions and deletions (nomismatches).In the LCS Problem, we scored 1 for matches and 0 for indelsConsider penalizing indels and mismatches with negative scoresSimplest scoring schema:

I +1 : match premiumI -µ : mismatch penaltyI -σ : indel penalty

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 3 / 69

Page 4: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Simple Scoring

When mismatches are penalized by -µ, indels are penalized by -σ, andmatches are rewarded with +1, the resulting score is:

#matches− µ(#mismatches)− σ(#indels)

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 4 / 69

Page 5: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

The Global Alignment Problem

Goal: Find the best alignment between two sequences (strings) under agiven scoring schemaInput : Sequences (strings) v and w and a scoring schemaOutput : Alignment of maximum score

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 5 / 69

Page 6: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Global alignment: Needleman-Wunsch algorithm

The Needleman-Wunsch algorithm1 is a dynamic program that solves theproblem of obtaining the best global alignment of two sequences.Idea: Build up an optimal alignment using previous solutions for optimalalignments of smaller substrings.Given two sequences X = (x1, x2, . . . , xn) and Y = (y1, y2, . . . , ym). Wewill compute a matrix

F : {1, 2, . . . , n} × {1, 2, . . . ,m} → R

in which F (i , j) equals the best score of the alignment of the two prefixes(x1, x2, . . . , xi ) and (y1, y2, . . . , yj).

1Saul Needleman and Christian Wunsch (1970), improved by Peter Sellers(1974).

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 6 / 69

Page 7: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Needleman-Wunsch algorithmThis will be done recursively by setting F (0, 0) = 0 and then computingF (i , j) from F (i − 1, j − 1), F (i − 1, j) and F (i , j − 1):

0 x1 x2 . . . xi−1 xi . . . xn

0 F (0, 0) |y1 |y2 |

|. . . |yj−1 F (i − 1, j − 1) F (i , j − 1)

↘ ↓yj − − − − F (i − 1, j) → F (i , j)

. . .

ym

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 7 / 69

Page 8: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

The Global Alignment Problem

We obtain F (i , j) as the largest score arising from these three options:

F (i , j) := max

F (i − 1, j − 1) + s(xi , yj)F (i − 1, j − 1)− µF (i − 1, j)− σF (i , j − 1)− σ.

This is applied repeatedly until the whole matrix F (i , j) is filled with values.

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 8 / 69

Page 9: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

The recursion

To complete the description of the recursion, we need to set the values ofF (i , 0) and F (0, j) for i 6= 0 and j 6= 0:

We set F (i , 0) = for i = 0, 1, . . . , n andwe set F (0, j) = for j = 0, 1, . . . ,m.

The final value F (n,m) contains the score of the best global alignmentbetween X and Y .To obtain an alignment corresponding to this score, we must find the pathof choices that the recursion made to obtain the score using traceback.

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 9 / 69

Page 10: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Example of a global alignment matrix

Needleman-Wunsch matrix of the sequences GATTAG and ATTAC, scoringvalues s(a, a) = 1, s(a, b) = −1 and a linear gap cost of σ = −2:

F 0 G A T T A G0 0ATTAC

Score: ; Alignment:

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 10 / 69

Page 11: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Example of a global alignment matrix

Needleman-Wunsch matrix of the sequences GATTAG and ATTAC, scoringvalues s(a, a) = 1, s(a, b) = −1 and a linear gap cost of σ = −2:

D 0 G A T T A G0 0 -2 -4 -6 -8 -10 -12A -2 -1 -1 -3 -5 -7 -9T -4 -3 -2 0 -2 -4 -6T -6 -5 -4 -1 1 -1 -3A -8 -7 -4 -3 0 2 0C -10 -9 -6 -5 -2 0 1

Score:1; AlignmentG A T T A G- A T T A C

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 11 / 69

Page 12: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Pseudo code of Needleman-Wunsch algorithm

Input: two sequences X and YOutput: optimal alignment and score αInitialization: Set F (i , 0) := −i · σ for all i = 0, 1, 2, . . . , nSet F (0, j) := −j · σ for all j = 0, 1, 2, . . . ,mFor i = 1, 2, . . . , n do:

For j = 1, 2, . . . ,m do:

Set F (i , j) := max

F (i − 1, j − 1) + s(xi , yj )F (i − 1, j)− σF (i , j − 1)− σ

Set backtrace T (i , j) to the maximizing pair (i ′, j ′)The best score is α := F (n,m)Set (i , j) := (n,m)

repeatif T (i , j) = (i − 1, j − 1) print

(xi−1yj−1

)else if T (i , j) = (i − 1, j) print

(xi−1−

)else print

( −yj−1

)Set (i , j) := T (i , j)

until (i , j) = (0, 0).

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 12 / 69

Page 13: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Complexity

Complexity of the Needleman-Wunsch algorithm:We need to store (n + 1)× (m + 1) numbers. Each number takes aconstant number of calculations to compute: three sums and a max.Hence, the algorithm requires O(nm) time and memory.

Something to think about: if we are only interested in the best score, but not theactual alignment, then it is easy to reduce the space requirement to linear.

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 13 / 69

Page 14: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Scoring Matrices

To generalize scoring, consider a (4 + 1)× (4 + 1) scoring matrix δIn the case of an amino acid sequence alignment, the scoring matrix wouldbe a (20 + 1)× (20 + 1) size.The addition of 1 is to include the score for comparison of a gap character“-".This will simplify the algorithm as follows:

si ,j = max

si−1,j−1 + δ(vi ,wi )si−1,j + δ(vi ,−)si ,j−1 + δ(−,wj)

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 14 / 69

Page 15: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Scoring Matrices

To generalize scoring, consider a (4 + 1)× (4 + 1) scoring matrix δIn the case of an amino acid sequence alignment, the scoring matrix wouldbe a (20 + 1)× (20 + 1) size.The addition of 1 is to include the score for comparison of a gap character“-".This will simplify the algorithm as follows:

si ,j = max

si−1,j−1 + δ(vi ,wi )si−1,j + δ(vi ,−)si ,j−1 + δ(−,wj)

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 14 / 69

Page 16: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Measuring Similarity

Measuring the extent of similarity between two sequences

Based on percent sequence identityBased on conservation

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 15 / 69

Page 17: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Percent Sequence Identity

The extent to which two nucleotide or amino acid sequences are invariant

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 16 / 69

Page 18: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Conservation

Amino acid changes that tend to preserve the physico-chemical propertiesof the original residue

Polar to polar:aspartate → glutamate

Nonpolar to nonpolar:alanine → valine

Similarly behaving residues:leucine to isoleucine

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 17 / 69

Page 19: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

The scoring model

The algorithms that compute an alignment critically depend on thechoice of the parameters for substitutions, deletions and insertions.Generally no existing scoring model can be applied to all situations. Herethe underlying question and/or application always needs to be considered.Generally pairwise alignments are conducted when

Evolutionary relationships between the sequences are reconstructed.Here scoring matrices based on mutation rates are usually applied.Protein domains are compared. Then the scoring matrices should bebased on composition of domains and their substitution frequency.

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 18 / 69

Page 20: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Substitution matrices

To be able to score an alignment, we need to determine score terms foreach aligned residue pair.

DefinitionA substitution matrix S over an alphabet Σ = {a1, . . . , aκ} has κ× κentries, where each entry (i , j) assigns a score for a substitution of theletter ai by the letter aj in an alignment.

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 19 / 69

Page 21: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Making a Scoring Matrix

Scoring matrices are created based on biological evidence.Alignments can be thought of as two sequences that differ due tomutations.Some of these mutations have little effect on the protein’s function,therefore some penalties, δ(vi ,wj), will be less harsh than others.

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 20 / 69

Page 22: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Scoring matrix example

Notice that although R and Kare different amino acids, theyhave a positive score.Why? They are both positivelycharged amino acids → will notgreatly change function ofprotein.

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 21 / 69

Page 23: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Similarity of AA residues

AA have different properties → substitution probabilities are different foreach AA

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 22 / 69

Page 24: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Scoring matrices

Amino acid substitution matrices

PAMBLOSUM

DNA substitution matrices

DNA is less conserved than protein sequencesLess effective to compare coding regions at nucleotide level

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 23 / 69

Page 25: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Substitution matrices

Consider non-gapped alignments

X = x1x2 . . . xn

Y = y1y2 . . . yn

Null hypothesis: the two sequences are unrelated (not homologous). Thealignment is then random with a probability described by the model R :each letter a occurs independently with some probability pa, and hence theprobability of the two sequences is the product:

P(X ,Y | R) =∏i

pxi

∏j

pyj .

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 24 / 69

Page 26: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Substitution matrices

Alternative hypothesis, match model M: the two sequences are related(homologous). In the aligned pairs of residues occur with a joint probabilitypab, which is the probability that a and b have each evolved from someunknown original residue c as their common ancestor. Thus, the probabilityfor the whole alignment is:

P(X ,Y | M) =∏i

pxiyi .

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 25 / 69

Page 27: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Substitution matrices

The ratio of the two gives a measure of the relative likelihood that thesequences are related (model M) as opposed to being unrelated (model R).This ratio is called odds ratio:

P(X ,Y | M)

P(X ,Y | R)=

∏i pxiyi∏

i pxi

∏i pyi

=∏i

pxiyi

pxipyi

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 26 / 69

Page 28: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Substitution matrices

To obtain an additive scoring scheme, we take the logarithm (base 2 isusually chosen) to get the log-odds ratio:

S = log(P(X ,Y | M)

P(X ,Y | R)) = log(

∏i

pxiyi

pxipyi

) =∑

i

s(xi , yi ),

with

s(a, b) := log(

pab

papb

).

We thus obtain a matrix S = s(a, b) that determines a score for eachaligned residue pair, known as a score or substitution matrix.For amino-acid alignments, commonly used matrices are the PAM andBLOSUM matrices.

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 27 / 69

Page 29: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Two major scoring matrices for AA sequence comparisons

PAM-derived from sequences known to be closely related (Eg.Chimpanzee and human). Ranges from PAM1 to PAM500BLOSUM-derived from sequences not closely related (Eg. E. coli andhuman). Ranges from BLOSUM 10-BLOSUM 100

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 28 / 69

Page 30: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

PAM

Point Accepted Mutation (M. Dayhoff et al., 1978)A series of matrices describing the extent to which two amino acidshave been interchanged in evolutionPAM-1 scoring matrix was obtained by aligning very similar sequences.Other PAMs were obtained by mathematical extrapolation

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 29 / 69

Page 31: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

PAM

1PAM = PAM1 = 1% average change of all amino acid positions (onepoint mutation every 100 AA)

After 100 PAMs of evolution, not every residue will have changedI some residues may have mutated several timesI some residues may have returned to their original stateI some residues may not changed at all

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 30 / 69

Page 32: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

PAM

Other PAM matrices are calculated fromPAM1 → PAM1 ∗ ... ∗ PAM1 = PAMx

xThis asumes, that mutations keep the same pattern as in the PAM1matrix and that multiple substitutions can occur at the same time.These matrices PAMx are appropriate to evaluate evolutionarydistanced sequences

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 31 / 69

Page 33: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

PAMMultiply PAM1 by itself 250 timesEquivalent to 250 susbtitution every 100 AAMore substitutions than AA → multiple substitutions!Valid, for long periods of timePAM100:

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 32 / 69

Page 34: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

BLOSUM

Blocks Substitution MatrixScores derived from observations of the frequencies of substitutions inblocks of local alignments in related proteinsMatrix name indicates evolutionary distanceBLOSUM62 was created using sequences sharing no more than 62%identity

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 33 / 69

Page 35: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

BLOSUM

BLOSUM are built from distantly related sequences within conservedblocks whereas PAM is built from closely related sequencesBLOSUM are built from conserved blocks of aligned protein segmentsfound in the BLOCKS database (the BLOCKS database is a secondarydatabase that derives information from the PROSITE Familydatabase)

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 34 / 69

Page 36: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

The Blosum50 Scoring Matrix

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 35 / 69

Page 37: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

BLOSUM

Version 8.0 of the Blocks Database consists of 2884 blocks based on 770protein families documented in PROSITE.

Hypothetical entry in red box in BLOCK record:

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 36 / 69

Page 38: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Building BLOSUM Matrices

1 To build the BLOSUM62 matrix one must eliminate sequences thatare identical in more than 62% of their AA sequences.

2 This is done by either removing sequences from the BLOCK or byfinding a cluster of similar sequences and replacing it with a singlerepresentative sequence.

3 Next, the probability for a pair of amino acids to be placed in thesame column is calculated. In the previous page this would be theprobability of replacement of A with A, A with B, A with C, and Bwith C. This gives the value pab

4 Next, one calculates the probability that the replacement amino acidfrequency exists in nature, pa ∗ pb.

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 37 / 69

Page 39: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Building BLOSUM Matrices

5 Finally, we calculate the log odds ratio sa,b = log2(pab/pa ∗ pb). Thisvalue is entered into the matrix.Which BLOSUM to use?

If you are comparing sequences that are very similar, use BLOSUM 80.Sequence comparisons that are more divergent (dissimilar) than 20% aregiven very low scores in this matrix.

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 38 / 69

Page 40: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Dynamic ProgrammingPart IV: Local Alignment

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 39 / 69

Page 41: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Local vs. Global Alignment

The Global Alignment Problem tries to find the longest path betweenvertices (0, 0) and (n,m) in the edit graph.The Local Alignment Problem tries to find the longest path amongpaths between arbitrary vertices (i , j) and (i ′, j ′) in the edit graph.In the edit graph with negatively-scored edges, Local Alignment mayscore higher than Global Alignment

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 40 / 69

Page 42: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Local Alignment

Two genes in different species may be similar over short conservedregions and dissimilar over remaining regions.Example:

I Homeobox genes have a short region called the homeodomain that ishighly conserved between species.

I A global alignment would not find the homeodomain because it wouldtry to align the ENTIRE sequence

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 41 / 69

Page 43: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

The Local Alignment Problem

Goal: Find the best local alignment between two strings

Input: Strings v, w and scoring matrix δ

Output: Alignment of substrings of v and w whose alignment score ismaximum among all possible alignment of all possible substrings

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 42 / 69

Page 44: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

The Problem with this Problem

Long run time O(n4):I In the grid of size n × n there are n2 vertices (i , j) that may serve as a

source.I For each such vertex computing alignments from (i , j) to (i ′, j ′) takes

O(n2) time.

This can be remedied by giving free rides

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 43 / 69

Page 45: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Local Alignment: Example

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 44 / 69

Page 46: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Local Alignment: Example

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 45 / 69

Page 47: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Local Alignment: Example

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 46 / 69

Page 48: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Local Alignment: Example

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 47 / 69

Page 49: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Local Alignment: Example

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 48 / 69

Page 50: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Local Alignment: Example

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 49 / 69

Page 51: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Local Alignment: Example

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 50 / 69

Page 52: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Local Alignment: Example

Long run time O(n4):

In the grid of size n × n thereare n2 vertices (i , j) that mayserve as a source.For each such vertex computingalignments from (i , j) to (i ′, j ′)takes O(n2) time.

This can be remedied by giving freerides

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 51 / 69

Page 53: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Local Alignment: Free Rides

The dashed edges represent the free rides from (0, 0) to every other node.

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 52 / 69

Page 54: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

The Local Alignment Recurrence

Smith-Waterman Algorithm:

The largest value of si ,j over the whole edit graph is the score of thebest local alignment.The recurrence:

si .j = max

0

→ There is only this change from the originalrecurrence of the Global Alignment

si−1,j−1 + δ(vi ,wi )si−1,j + δ(vi ,−)si ,j−1 + δ(−,wi )

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 53 / 69

Page 55: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

The Local Alignment Recurrence

Smith-Waterman Algorithm:

The largest value of si ,j over the whole edit graph is the score of thebest local alignment.The recurrence:

si .j = max

0 → There is only this change from the original

recurrence of the Global Alignmentsi−1,j−1 + δ(vi ,wi )si−1,j + δ(vi ,−)si ,j−1 + δ(−,wi )

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 53 / 69

Page 56: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

The Local Alignment Recurrence

Smith-Waterman Algorithm:

The largest value of si ,j over the whole edit graph is the score of thebest local alignment.The recurrence:

si .j = max

0 → Power of ZERO: there is only this change,

since there is only one “free ride" edgesi−1,j−1 + δ(vi ,wi ) entering into every vertexsi−1,j + δ(vi ,−)si ,j−1 + δ(−,wi )

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 54 / 69

Page 57: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Scoring:indel −2match +2subst. −1

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 55 / 69

Page 58: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Scoring:indel −2match +2subst. −1

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 56 / 69

Page 59: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Scoring:indel −2match +2subst. −1

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 57 / 69

Page 60: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Scoring:indel −2match +2subst. −1

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 58 / 69

Page 61: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Scoring Indels: Naive Approach

A fixed penalty σ is given to every indel:−σ for 1 indel−2σ for 2 consecutive indels−3σ for 3 consecutive indels, etc.γ(g) = −gσ

That is a linear gap penalty. Can be too severe penalty for a series of 100consecutive indels.

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 59 / 69

Page 62: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Affine Gap Penalties

In nature, a series of k indels often come as a single event rather than aseries of k single nucleotide events:

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 60 / 69

Page 63: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Accounting for Gaps

Gaps: contiguous sequence of spaces in one of the rows.Instead of a linear score, an affine score is biologically more plausible.The score for a gap of length g is given by:

γ(g) = −σ − (g − 1)e,

where σ is the gap open penalty and e is the gap extension penalty.Usually, e < σ, with the result that less isolated gaps are produced, asshown in the following comparison:

Linear gap penalty: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLGSAQVKGHGKK––––VA–D––A-SALSDLHAHKL

Affine gap penalty: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLGSAQVKGHGKKVADA–––––––-SALSDLHAHKL

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 61 / 69

Page 64: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Accounting for Gaps

Gaps: contiguous sequence of spaces in one of the rows.Instead of a linear score, an affine score is biologically more plausible.The score for a gap of length g is given by:

γ(g) = −σ − (g − 1)e,

where σ is the gap open penalty and e is the gap extension penalty.Usually, e < σ, with the result that less isolated gaps are produced, asshown in the following comparison:

Linear gap penalty: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLGSAQVKGHGKK––––VA–D––A-SALSDLHAHKL

Affine gap penalty: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLGSAQVKGHGKKVADA–––––––-SALSDLHAHKL

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 61 / 69

Page 65: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Affine Gap Penalties

Gap penalties: γ(g) = −σ − (g − 1)e

−σ when there is 1 indel−σ − e when there are 2 indels−σ − 2e when there is 3 indels

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 62 / 69

Page 66: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Adding “Affine Penalty" Edges to the Edit Graph

To reflect affine gappenalties we have to add“long" horizontal andvertical edges to the editgraph. Each such edge oflength g should haveweight

−σ − (g − 1)e

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 63 / 69

Page 67: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Adding “Affine Penalty" Edges to the Edit Graph

There are many suchedges!Adding them to the graphincreases the running timeof the alignmentalgorithm by a factor of n(where n is the number ofvertices)So the complexityincreases from O(n2) toO(n3)

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 64 / 69

Page 68: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Manhattan in 3 Layers

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 65 / 69

Page 69: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Affine Gap Penalties and 3 Layer Manhattan Grid

The three recurrences for the scoring algorithm creates a 3-layeredgraphThe top level creates/extends gaps in the sequence wThe bottom level creates/extends gaps in sequence vThe middle level extends matches and mismatches

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 66 / 69

Page 70: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Switching between 3 Layers

Levels:I The main level is for diagonal edgesI The lower level is for horizontal edgesI The upper level is for vertical edges

A jumping penalty is assigned to moving from the main level to eitherthe upper level or the lower level (−σ)There is a gap extension penalty for each continuation on a level otherthan the main level (−e)

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 67 / 69

Page 71: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

The 3-leveled Manhattan Grid

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 68 / 69

Page 72: Dynamic Programming Part III: Global sequence alignment …€¦ · Dynamic Programming Part III: Global sequence alignment & Scoring matrices Bioinfo I (Institut Pasteur de Montevideo)

Affine Gap Penalty Recurrences

Bioinfo I (Institut Pasteur de Montevideo) Dyn. Programming -class 5- July 26th, 2011 69 / 69