Sequence Alignment

Post on 18-Jan-2016

43 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Sequence Alignment. Arthur W. Chou Tunghai University Fall 2005. Sequence Alignment. Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: GCGCATGGATTGAGCGA TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A. - PowerPoint PPT Presentation

Transcript

Sequence AlignmentSequence Alignment

Arthur W. ChouArthur W. Chou

Tunghai UniversityTunghai University

Fall 2005Fall 2005

Sequence AlignmentSequence Alignment

Input: Input: two sequences over the same alphabettwo sequences over the same alphabet

Output:Output: an an alignmentalignment of the two sequences of the two sequences

Example:Example: GCGCATGGATTGAGCGAGCGCATGGATTGAGCGA TGCGCCATTGATGACCATGCGCCATTGATGACCA

A possible alignment:A possible alignment:

-GCGC-ATGGATTGAGCGA-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-ATGCGCCATTGAT-GACC-A

Why align sequences?Why align sequences?

Lots of sequences don’t have known ancestry, Lots of sequences don’t have known ancestry, structure, or function. A few of them do. structure, or function. A few of them do.

If they align, they are similar.If they align, they are similar.

If they are similar, they might have the same If they are similar, they might have the same

ancestry, similar structure or function.ancestry, similar structure or function.

If one of them has known ancestry, structure, orIf one of them has known ancestry, structure, or

function, then alignment to the others yieldsfunction, then alignment to the others yields

insight about them.insight about them.

Alignments

-GCGC-ATGGATTGAGCGA-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-ATGCGCCATTGAT-GACC-A

Three kinds of match:Three kinds of match:

Exact matchesExact matches

MismatchesMismatches

Indels (gaps)Indels (gaps)

Choosing AlignmentsChoosing Alignments

There are many possible alignmentsThere are many possible alignments

For example, compare:For example, compare:

-GCGC-ATGGATTGAGCGA-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-ATGCGCCATTGAT-GACC-A

toto

------GCGCATGGATTGAGCGA------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--TGCGCC----ATTGATGACCA--

Which one is better?Which one is better?

Scoring AlignmentsScoring Alignments

Similar sequences evolved from a common ancestorSimilar sequences evolved from a common ancestor Evolution changed the sequences from this ancestral Evolution changed the sequences from this ancestral

sequence by sequence by mutations:mutations: ReplacementReplacement: one letter replaced by another: one letter replaced by another DeletionDeletion: deletion of a character: deletion of a character InsertionInsertion:: insertion of a characterinsertion of a character

Scoring of sequence similarity should examine how Scoring of sequence similarity should examine how many and which operations took placemany and which operations took place

Simple Scoring Rule

Score each position independently:Score each position independently:

Match: Match: +1+1 Mismatch:Mismatch: -1 -1 Indel:Indel: -2 -2

Score of an alignment is sum of position scoresScore of an alignment is sum of position scores

ExampleExample

-GCGC-ATGGATTGAGCGA-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-ATGCGCCATTGAT-GACC-A

Score: Score: (+1(+113)13) + + (-1 (-1 2) 2) + + (-2 (-2 4) 4) = 3 = 3

------GCGCATGGATTGAGCGA------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--TGCGCC----ATTGATGACCA--

Score: Score: (+1 (+1 5) 5) + + (-1 (-1 6) 6) + + (-2 (-2 11) 11) = -23 = -23

More General More General ScoresScores

The choice of +1,-1, and -2 scores is quite The choice of +1,-1, and -2 scores is quite arbitraryarbitrary

Depending on the context, some changes are Depending on the context, some changes are more more plausibleplausible than others than others Exchange of an amino-acid by one with similar Exchange of an amino-acid by one with similar

properties (size, charge, etc.) vs.properties (size, charge, etc.) vs. Exchange of an amino-acid by one with opposite Exchange of an amino-acid by one with opposite

propertiesproperties

Probabilistic interpretation: How likely is one Probabilistic interpretation: How likely is one alignment versus another ?alignment versus another ?

Dot Matrix MethodDot Matrix Method A dot is placed at each position A dot is placed at each position

where two residues match.where two residues match. It's a It's a visual aidvisual aid. The human eye . The human eye

can rapidly identify similar can rapidly identify similar regions in sequences.regions in sequences.

It's a good way to explore It's a good way to explore sequence organization: e.g. sequence organization: e.g. sequence repeats.sequence repeats.

It does It does notnot provide an provide an alignment.alignment.

T H E F A T C A T

T

H

E

F

A

S

T

C

A

T

THEFA-TCATTHEFA-TCAT||||| ||||||||| ||||THEFASTCATTHEFASTCAT

THEFA-TCATTHEFA-TCAT||||| ||||||||| ||||THEFASTCATTHEFASTCAT

This method produces dot-plots with too much noise This method produces dot-plots with too much noise

to be usefulto be useful The noise can be reduced by calculating a score The noise can be reduced by calculating a score using a using a windowwindow of residues. of residues. The score is compared to a The score is compared to a thresholdthreshold or or stringency.stringency.

Dot Matrix Representation

Produces a graphical Produces a graphical representation of representation of similarity regionssimilarity regions

The horizontal and The horizontal and vertical dimensions vertical dimensions correspond to the correspond to the compared sequencescompared sequences

A region of similarity A region of similarity stands out as a stands out as a diagonaldiagonal

Tissue-Type plasminogen Activator

Uro

kin

ase

-Type p

lasm

inog

en

Activ

ato

r

Dot Matrix or Dot-plotDot Matrix or Dot-plot

Each window of the first sequence is aligned (without gaps) to each window of the 2nd sequence A colour is set into a rectangular array according to the score of the aligned windows

Each window of the first sequence is aligned (without gaps) to each window of the 2nd sequence A colour is set into a rectangular array according to the score of the aligned windows

T H E F A T C A T

T

H

E

F

A

S

T

C

A

T

THE|||THE

THE|||THE

Score: 23

THE

HEF

THE

HEF

Score: -5

CAT

THE

CAT

THE

Score: -4

HEF

THE

HEF

THE

Score: -5

Dot Matrix DisplayDot Matrix Display

Diagonal rows ( ) of dots Diagonal rows ( ) of dots reveal sequence similarity reveal sequence similarity or or repeatsrepeats..

Anti-diagonal rows ( ) Anti-diagonal rows ( ) of dots represent of dots represent invertedinverted repeatsrepeats..

Isolated dots represent Isolated dots represent random similarity.random similarity.

H C G E T F G R W F T P E WK C •G •P •T • •F • •G •R •IAC •G • •E • •M

Dot matrix web serverhttp://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

We can filter it by using a We can filter it by using a sliding window looking for sliding window looking for longer strings of matches and longer strings of matches and eliminates random matcheseliminates random matches

Longest CommonCommon SubsequenceLongest CommonCommon Subsequence

Sequence A:Sequence A: nematode_knowledgenematode_knowledgeSequence B:Sequence B: empty_bottleempty_bottle

n e m a t o d e _ k n o w l e d g e n e m a t o d e _ k n o w l e d g e | | | | | | | | | | | | | |

e m p t y _ b o t t l ee m p t y _ b o t t l e

LCS Alignment with match score 1, LCS Alignment with match score 1, mismatch score 0, and gap penalty mismatch score 0, and gap penalty

00

What is an algorithm?What is an algorithm? A step-by-step description of the procedures A step-by-step description of the procedures

to accomplish a task.to accomplish a task. Properties:Properties:1.1. Determination of output for each inputDetermination of output for each input

2.2. GeneralityGenerality

3.3. TerminationTermination

Criteria:Criteria: 1.1. Correctness (proof, test, etc.)Correctness (proof, test, etc.)

2.2. Time efficiency (no. of steps is small)Time efficiency (no. of steps is small)

3.3. Space efficiency (spaced used is small)Space efficiency (spaced used is small)

Naïve algorithm: exhaustive searchNaïve algorithm: exhaustive search

G C G A A T G G A T T G A G C G TG C G A A T G G A T T G A G C G T

T G A G C C A T T G A T G A C C AT G A G C C A T T G A T G A C C A

ii

jj

Worst case time complexity is ~ 2Worst case time complexity is ~ 2

i j j i j i j j i i . . . . . . . . . . . . . .i j j i j i j j i i . . . . . . . . . . . . . .

sequences of length “n”

2n2n

Dynamic programming algorithms for pairwise sequence alignmentDynamic programming algorithms for pairwise sequence alignment

Similar to Longest Common SubsequenceSimilar to Longest Common Subsequence Introduced for biological sequences byIntroduced for biological sequences by

S. B. Needleman & C. D. Wunsch.S. B. Needleman & C. D. Wunsch. A general A general method applicable to the search for similarities method applicable to the search for similarities in the amino acid sequence of two proteins.in the amino acid sequence of two proteins. J. Mol. Biol. 48:J. Mol. Biol. 48:443-453 (1970)443-453 (1970)

Dynamic ProgrammingDynamic Programming Optimality substructureOptimality substructure Reduction to a “small” number of sub-problemsReduction to a “small” number of sub-problems Memorization of solutions to sub-problems in a Memorization of solutions to sub-problems in a

tabletable Table look-up and tracingTable look-up and tracing

- G C G C – A T G G A T T G A G C G A- G C G C – A T G G A T T G A G C G A

T G C G C C A T T G A T – G A C C - AT G C G C C A T T G A T – G A C C - A

- G C G C – A T G G A T T G A G C G A- G C G C – A T G G A T T G A G C G A

T G C G C C A T T G A T – G A C C - AT G C G C C A T T G A T – G A C C - A

Optimality Sub-structure

二陽指 TM二陽指 TM

G C G A A T G G A T T G A G C G TG C G A A T G G A T T G A G C G T

T G A G C C A T T G A T G A C C AT G A G C C A T T G A T G A C C A

ii

jj

sequences of length “n”

- G C G C – A T G G A T T G A G C G A- G C G C – A T G G A T T G A G C G A

T G C G C C A T T G A T – G A C C - AT G C G C C A T T G A T – G A C C - A

- G C G C – A T G G A T T G A G C G A- G C G C – A T G G A T T G A G C G A

T G C G C C A T T G A T – G A C C - AT G C G C C A T T G A T – G A C C - A

Recursive LCSRecursive LCSRecursive LCSRecursive LCS

int int lcs_len lcs_len ( i , j ) { ( i , j ) { if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’ ) return 0 ;if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’ ) return 0 ; else else if (A[ i ] == B[ j ] ) if (A[ i ] == B[ j ] )

return ( 1 + return ( 1 + lcs_len lcs_len ( i+1, j+1 ) ) ;( i+1, j+1 ) ) ; else else

return max ( return max ( lcs_len lcs_len ( i+1, j ) ,( i+1, j ) , lcs_len lcs_len ( i, j+1 )( i, j+1 )

); ); } }

lcs_len( i , j ): length of LCS from i-th position onward in String A and from j-th position onward in String B

Reduction to SubproblemsReduction to SubproblemsReduction to SubproblemsReduction to Subproblemsintint lcs_len lcs_len ( String A , String B )( String A , String B ) { return{ return subproblem subproblem ( 0, 0 ); ( 0, 0 ); } }

intint subproblem subproblem ( int( int i i, , int int jj )){ if (A[{ if (A[ ii ] == ‘\0’ ||] == ‘\0’ || B[B[ jj ] == ‘\0’) return 0;] == ‘\0’) return 0; else else

if ( A[if ( A[ ii ] == B[] == B[ jj ] )] ) return (1 +return (1 + subproblemsubproblem (( i+1i+1, , j+1j+1 ));)); else return else return

max (max ( subproblemsubproblem (( i+1i+1, , jj ) ,) , subproblem subproblem (( i i,, j+1j+1 ) );) );

} }

Memorizing the solutions :Memorizing the solutions :Memorizing the solutions :Memorizing the solutions :

Matrix L[ i , j ] = -1 ; Matrix L[ i , j ] = -1 ; // initializing the memory device// initializing the memory device

int subproblem ( int i, int j ) { int subproblem ( int i, int j ) {

if ( L[i, j] < 0 ) { if ( L[i, j] < 0 ) {

if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[i , j] = 0;if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[i , j] = 0;

else if ( A[ i ] == B[ j ] ) else if ( A[ i ] == B[ j ] )

L[i, j] = 1 + subproblem(i+1, j+1);L[i, j] = 1 + subproblem(i+1, j+1);

else L[i, j] = max( subproblem(i+1, j),else L[i, j] = max( subproblem(i+1, j),

subproblem(i, j+1));subproblem(i, j+1));

} return L[ i, j ] ; } return L[ i, j ] ;

} }

Iterative LCS: Table Look-upIterative LCS: Table Look-upIterative LCS: Table Look-upIterative LCS: Table Look-up

To get the length of LCS of A and BTo get the length of LCS of A and B

{{

first allocate storage for the matrix L; first allocate storage for the matrix L;

for each row for each row ii from from mm downto downto 00

for each column for each column jj from from nn downto downto 00

if (A[ i ] == ‘\0’ or B[ j ] == ‘\0’) L[ i, j ] = 0; if (A[ i ] == ‘\0’ or B[ j ] == ‘\0’) L[ i, j ] = 0;

else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];

else L[ i, j ] = max(L[i+1, j], L[i, j+1]); else L[ i, j ] = max(L[i+1, j], L[i, j+1]);

} }

return L[0, 0]; return L[0, 0];

} }

Iterative LCS: Table Look-upIterative LCS: Table Look-upIterative LCS: Table Look-upIterative LCS: Table Look-up

int lcs_len ( String A , String B ) int lcs_len ( String A , String B ) // find the length// find the length

{{

// First allocate storage for the matrix L; // First allocate storage for the matrix L;

for ( i = m ; i >= 0 ; i-- ) for ( i = m ; i >= 0 ; i-- ) // A has length m+1// A has length m+1

for ( j = n ; j >= 0 ; j-- ) { for ( j = n ; j >= 0 ; j-- ) { // B has length n+1// B has length n+1

if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[ i, j ] = 0; if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[ i, j ] = 0;

else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];

else L[ i, j ] = max(L[i+1, j], L[i, j+1]); else L[ i, j ] = max(L[i+1, j], L[i, j+1]);

} }

return L[0, 0]; return L[0, 0];

} }

Dynamic Programming AlgorithmDynamic Programming Algorithm

L[L[ii, , jj] = 1 + L[] = 1 + L[i+1i+1, , j+1j+1] , if A[ i ] == B[ j ] ;] , if A[ i ] == B[ j ] ; L[L[ii, , jj] = max ( L[] = max ( L[i+1i+1, , jj], L[], L[ii, , j+1j+1] ) otherwise] ) otherwise

L[i+1, j+1]

L[ i, j ] L[ i, j+1 ]

L[ i+1, j ]

jj j+1j+1

ii

i+1i+1

BBAA

Matrix LMatrix L

n  e  m  a  t  o  d  e  _  k  n  o  w  l  e  d  g  e

e   7  7  6  5  5  5  5  5  4  3  3  3  2  2  2  1  1  1  0m   6  6  6  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0p   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0t   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0y   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0_   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0b   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0o   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0l   2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  0e   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  0    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

Obtain the subsequenceObtain the subsequence

Sequence S = empty; Sequence S = empty; // the LCS// the LCS

i = 0; j = 0; i = 0; j = 0;

while ( i < m && j < n) { while ( i < m && j < n) {

if ( A[ i ] == B[ j ] ) { if ( A[ i ] == B[ j ] ) {

add A[i] to end of S; add A[i] to end of S;

i++; j++; i++; j++;

} else } else

if ( L[i+1, j] >= L[i, j+1]) i+if ( L[i+1, j] >= L[i, j+1]) i++; +;

else j++; else j++;

} }

n e m a t o d e _ k n o w l e d g e

    e  o-o o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o       |  \| | | | |  \| | | | | |  \| |  \|    m  o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o       | |  \| | | | | | | | | | | | | | | |    p  o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o       | | | | | | | | | | | | | | | | | | |    t  o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o       | | | |  \| | | | | | | | | | | | | |    y  o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o       | | | | | | | | | | | | | | | | | | |    _  o-o-o-o-o-o-o-o-o o-o-o-o-o-o-o-o-o-o       | | | | | | | |  \| | | | | | | | | |    b  o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o       | | | | | | | | | | | | | | | | | | |    o  o-o-o-o-o-o o-o-o-o o o o-o-o-o-o-o-o       | | | | |  \| | | | |  | | | | | | |    t  o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o       | | | |  \| | | | | | | | | | | | |    t  o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o       | | | |  \| | | | | | | | | | | | | |    l  o-o-o-o-o-o-o-o-o-o-o-o-o-o o-o-o-o-o       | | | | | | | | | | | | |  \| | | | |    e  o-o o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o       |  \| | | | |  \| | | | | |  \| |  \|       o o o o o o o o o o o o o o o o o o o

Dynamic Programming with scores and penalties Dynamic Programming with scores and penalties

      

V A D L …….. K T A

N A K M …. D A L T

jj

ii

xx

yy

Dynamic Programming with scores and penalties

Dynamic Programming with scores and penalties

from ‘i-th’ pos. in A and ‘j-th’ pos. in B onwardfrom ‘i-th’ pos. in A and ‘j-th’ pos. in B onward

s ( A[i] , B[j] ) + S[i+1, j+1]s ( A[i] , B[j] ) + S[i+1, j+1]

S[i , j] = S[i , j] = maxmax max { S[i+ max { S[i+xx, j] – w( , j] – w( xx ); );

gap x in sequence B }gap x in sequence B }

max { S[i, j+max { S[i, j+yy] – w( ] – w( yy ); );

gap y in sequence gap y in sequence A }A }

best score from i, j onward

w : penalty function

s : score

Algorithm for simple gap penaltyAlgorithm for simple gap penalty

If for each gap, the penalty is a fixed constant If for each gap, the penalty is a fixed constant “c”, then“c”, then

•• s(A[ i ] , B[ j ]) + S[i+1, j+1];s(A[ i ] , B[ j ]) + S[i+1, j+1];

S[i , j] = S[i , j] = maxmax •• S[ i+1, j ] – c ; S[ i+1, j ] – c ; // one gap// one gap

•• S[ i, j+1 ] – c ; S[ i, j+1 ] – c ; // one gap// one gap

Table TracingTable Tracing To do table tracing based on To do table tracing based on similarity matrixsimilarity matrix of of amino acids, we re-define S[i , j] to be the optimal amino acids, we re-define S[i , j] to be the optimal score of choosing the match of A[i] with B[j]. score of choosing the match of A[i] with B[j].

S[ i , j ] = s (A[ i ] , B[ j ]) S[ i , j ] = s (A[ i ] , B[ j ]) + + // s : score// s : score

S[i+1, j+1]S[i+1, j+1] // w : gap penalty// w : gap penalty

max { S[i+1+max { S[i+1+xx, j+1] – w( x ); , j+1] – w( x );

+ max + max gap gap xx in sequence B } in sequence B }

max { S[i+1, j+1+max { S[i+1, j+1+yy] – w( y );] – w( y );

gap gap yy in sequence A } in sequence A }

DiagramDiagram

s[i, j]s[i, j]

S[i+1,j+1S[i+1,j+1]]

Matrix S:

i

j

i+1

j+1

Summation operationSummation operation

1. Start at lower right corner.1. Start at lower right corner.

2. Move diagonally up one position.2. Move diagonally up one position.

3. Find largest value in either 3. Find largest value in either

row segment starting diagonally below current row segment starting diagonally below current position and extending to the right or position and extending to the right or

column segment starting diagonally below column segment starting diagonally below current position and extending down.current position and extending down.

4. Add this value to the value in the current cell.4. Add this value to the value in the current cell.

5. Repeat steps 3 and 4 for all cells to the left in 5. Repeat steps 3 and 4 for all cells to the left in current row and all cells above in current column.current row and all cells above in current column.

6. If we are not in the top left corner, go to step 2.6. If we are not in the top left corner, go to step 2.

----V----V

HGQKVHGQKV

----VA----VA

HGQKVAHGQKVA

----VADALTK----VADALTK

HGQKVADALTKHGQKVADALTK

----VADALTK----VADALTK

HGQKVADALTKHGQKVADALTK

----VADALTKPVNFKFA----VADALTKPVNFKFA

HGQKVADALTK------AHGQKVADALTK------A

----VADALTKPVNFKFAVAH----VADALTKPVNFKFAVAH

HGQKVADALTK------AVAHHGQKVADALTK------AVAH

Use of dynamic programming to evaluate homology between pairs of sequencesUse of dynamic programming to evaluate homology between pairs of sequences

If we just want to know maximum match If we just want to know maximum match possible between two sequences, then we possible between two sequences, then we don’t need to do trace-back but can just don’t need to do trace-back but can just look at the highest value in the first row or look at the highest value in the first row or column (“match score”). This represents column (“match score”). This represents the best possible alignment score.the best possible alignment score.

Gap penalty alternatives :Gap penalty alternatives :

constant gap penalty for gap > 1constant gap penalty for gap > 1 gap penalty proportional to gap size (affine gap penalty proportional to gap size (affine

gap penalty)gap penalty) one penalty for starting a gap (gap opening one penalty for starting a gap (gap opening

penalty)penalty) different (lower) penalty for adding to a gap different (lower) penalty for adding to a gap

(gap extension penalty)(gap extension penalty) dynamic programming algorithm can be made dynamic programming algorithm can be made

more efficientmore efficient

Gap penalty alternatives (cont.)Gap penalty alternatives (cont.)

gap penalty proportional to gap size and gap penalty proportional to gap size and sequencesequence for nucleic acids, can be used to mimic for nucleic acids, can be used to mimic

thermodynamics of helix formation.thermodynamics of helix formation. two kinds of gap opening penaltiestwo kinds of gap opening penalties

one for gap closed by AT, different for GC.one for gap closed by AT, different for GC.

different gap extension penalty.different gap extension penalty.

End gapsEnd gaps

Some programs treat end gaps as normal Some programs treat end gaps as normal gaps and apply penalties, other programs do gaps and apply penalties, other programs do not apply penalties for end gaps.not apply penalties for end gaps.

End gaps (cont.)End gaps (cont.)

Can determine which a program does by adding Can determine which a program does by adding extra (unmatched) bases to the end of one extra (unmatched) bases to the end of one sequence and seeing if match score changes.sequence and seeing if match score changes.

Penalties for end gaps appropriate for aligned Penalties for end gaps appropriate for aligned sequences where ends "should match“.sequences where ends "should match“.

Penalties for end gaps inappropriate when Penalties for end gaps inappropriate when surrounding sequences are expected to be surrounding sequences are expected to be different (e.g., conserved exon surrounded by different (e.g., conserved exon surrounded by varying introns).varying introns).

Global vs. Local SimilarityGlobal vs. Local Similarity Should result of alignment include all amino acids or Should result of alignment include all amino acids or

proteins or just those that proteins or just those that matchmatch?? If yes, a global alignment is desiredIf yes, a global alignment is desired If no, a local alignment is desiredIf no, a local alignment is desired

Global alignment is accomplished by including Global alignment is accomplished by including negative scores for “mismatched” positions, negative scores for “mismatched” positions, thus scores get worse as we move away from thus scores get worse as we move away from region of match (local alignment).region of match (local alignment).

Instead of starting trace-back with highest value Instead of starting trace-back with highest value in first row or column, start with highest value in first row or column, start with highest value in entire matrix, stop when score hits zero.in entire matrix, stop when score hits zero.

Local AlignmentLocal Alignment From ‘i-th’ pos. in A and ‘j-th’ pos. in B onwardFrom ‘i-th’ pos. in A and ‘j-th’ pos. in B onward

s ( A[i] , B[j] ) + s ( A[i] , B[j] ) + HH[i+1, j+1][i+1, j+1]

HH[i , j] = [i , j] = maxmax max { max { HH[i+[i+xx, j] – w( , j] – w( xx ); );

gap x in sequence B }gap x in sequence B }

max { max { HH[i, j+[i, j+yy] – w( ] – w( yy ); );

gap y in sequence gap y in sequence A }A }

w : penalty function

s : score

Best score of any prefix of the subsequence from i, j onward.

00

top related