Top Banner
Sequence Alignment Arthur W. Chou Arthur W. Chou Tunghai University Tunghai University Fall 2005 Fall 2005
64

Sequence Alignment

Jan 18, 2016

Download

Documents

raisie

Sequence Alignment. Arthur W. Chou Tunghai University Fall 2005. Sequence Alignment. Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: GCGCATGGATTGAGCGA TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sequence Alignment

Sequence AlignmentSequence Alignment

Arthur W. ChouArthur W. Chou

Tunghai UniversityTunghai University

Fall 2005Fall 2005

Page 2: Sequence Alignment

Sequence AlignmentSequence Alignment

Input: Input: two sequences over the same alphabettwo sequences over the same alphabet

Output:Output: an an alignmentalignment of the two sequences of the two sequences

Example:Example: GCGCATGGATTGAGCGAGCGCATGGATTGAGCGA TGCGCCATTGATGACCATGCGCCATTGATGACCA

A possible alignment:A possible alignment:

-GCGC-ATGGATTGAGCGA-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-ATGCGCCATTGAT-GACC-A

Page 3: Sequence Alignment

Why align sequences?Why align sequences?

Lots of sequences don’t have known ancestry, Lots of sequences don’t have known ancestry, structure, or function. A few of them do. structure, or function. A few of them do.

If they align, they are similar.If they align, they are similar.

If they are similar, they might have the same If they are similar, they might have the same

ancestry, similar structure or function.ancestry, similar structure or function.

If one of them has known ancestry, structure, orIf one of them has known ancestry, structure, or

function, then alignment to the others yieldsfunction, then alignment to the others yields

insight about them.insight about them.

Page 4: Sequence Alignment

Alignments

-GCGC-ATGGATTGAGCGA-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-ATGCGCCATTGAT-GACC-A

Three kinds of match:Three kinds of match:

Exact matchesExact matches

MismatchesMismatches

Indels (gaps)Indels (gaps)

Page 5: Sequence Alignment

Choosing AlignmentsChoosing Alignments

There are many possible alignmentsThere are many possible alignments

For example, compare:For example, compare:

-GCGC-ATGGATTGAGCGA-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-ATGCGCCATTGAT-GACC-A

toto

------GCGCATGGATTGAGCGA------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--TGCGCC----ATTGATGACCA--

Which one is better?Which one is better?

Page 6: Sequence Alignment

Scoring AlignmentsScoring Alignments

Similar sequences evolved from a common ancestorSimilar sequences evolved from a common ancestor Evolution changed the sequences from this ancestral Evolution changed the sequences from this ancestral

sequence by sequence by mutations:mutations: ReplacementReplacement: one letter replaced by another: one letter replaced by another DeletionDeletion: deletion of a character: deletion of a character InsertionInsertion:: insertion of a characterinsertion of a character

Scoring of sequence similarity should examine how Scoring of sequence similarity should examine how many and which operations took placemany and which operations took place

Page 7: Sequence Alignment

Simple Scoring Rule

Score each position independently:Score each position independently:

Match: Match: +1+1 Mismatch:Mismatch: -1 -1 Indel:Indel: -2 -2

Score of an alignment is sum of position scoresScore of an alignment is sum of position scores

Page 8: Sequence Alignment

ExampleExample

-GCGC-ATGGATTGAGCGA-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-ATGCGCCATTGAT-GACC-A

Score: Score: (+1(+113)13) + + (-1 (-1 2) 2) + + (-2 (-2 4) 4) = 3 = 3

------GCGCATGGATTGAGCGA------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--TGCGCC----ATTGATGACCA--

Score: Score: (+1 (+1 5) 5) + + (-1 (-1 6) 6) + + (-2 (-2 11) 11) = -23 = -23

Page 9: Sequence Alignment

More General More General ScoresScores

The choice of +1,-1, and -2 scores is quite The choice of +1,-1, and -2 scores is quite arbitraryarbitrary

Depending on the context, some changes are Depending on the context, some changes are more more plausibleplausible than others than others Exchange of an amino-acid by one with similar Exchange of an amino-acid by one with similar

properties (size, charge, etc.) vs.properties (size, charge, etc.) vs. Exchange of an amino-acid by one with opposite Exchange of an amino-acid by one with opposite

propertiesproperties

Probabilistic interpretation: How likely is one Probabilistic interpretation: How likely is one alignment versus another ?alignment versus another ?

Page 10: Sequence Alignment

Dot Matrix MethodDot Matrix Method A dot is placed at each position A dot is placed at each position

where two residues match.where two residues match. It's a It's a visual aidvisual aid. The human eye . The human eye

can rapidly identify similar can rapidly identify similar regions in sequences.regions in sequences.

It's a good way to explore It's a good way to explore sequence organization: e.g. sequence organization: e.g. sequence repeats.sequence repeats.

It does It does notnot provide an provide an alignment.alignment.

T H E F A T C A T

T

H

E

F

A

S

T

C

A

T

THEFA-TCATTHEFA-TCAT||||| ||||||||| ||||THEFASTCATTHEFASTCAT

THEFA-TCATTHEFA-TCAT||||| ||||||||| ||||THEFASTCATTHEFASTCAT

This method produces dot-plots with too much noise This method produces dot-plots with too much noise

to be usefulto be useful The noise can be reduced by calculating a score The noise can be reduced by calculating a score using a using a windowwindow of residues. of residues. The score is compared to a The score is compared to a thresholdthreshold or or stringency.stringency.

Page 11: Sequence Alignment

Dot Matrix Representation

Produces a graphical Produces a graphical representation of representation of similarity regionssimilarity regions

The horizontal and The horizontal and vertical dimensions vertical dimensions correspond to the correspond to the compared sequencescompared sequences

A region of similarity A region of similarity stands out as a stands out as a diagonaldiagonal

Tissue-Type plasminogen Activator

Uro

kin

ase

-Type p

lasm

inog

en

Activ

ato

r

Page 12: Sequence Alignment

Dot Matrix or Dot-plotDot Matrix or Dot-plot

Each window of the first sequence is aligned (without gaps) to each window of the 2nd sequence A colour is set into a rectangular array according to the score of the aligned windows

Each window of the first sequence is aligned (without gaps) to each window of the 2nd sequence A colour is set into a rectangular array according to the score of the aligned windows

T H E F A T C A T

T

H

E

F

A

S

T

C

A

T

THE|||THE

THE|||THE

Score: 23

THE

HEF

THE

HEF

Score: -5

CAT

THE

CAT

THE

Score: -4

HEF

THE

HEF

THE

Score: -5

Page 13: Sequence Alignment

Dot Matrix DisplayDot Matrix Display

Diagonal rows ( ) of dots Diagonal rows ( ) of dots reveal sequence similarity reveal sequence similarity or or repeatsrepeats..

Anti-diagonal rows ( ) Anti-diagonal rows ( ) of dots represent of dots represent invertedinverted repeatsrepeats..

Isolated dots represent Isolated dots represent random similarity.random similarity.

H C G E T F G R W F T P E WK C •G •P •T • •F • •G •R •IAC •G • •E • •M

Page 14: Sequence Alignment

Dot matrix web serverhttp://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

We can filter it by using a We can filter it by using a sliding window looking for sliding window looking for longer strings of matches and longer strings of matches and eliminates random matcheseliminates random matches

Page 15: Sequence Alignment

Longest CommonCommon SubsequenceLongest CommonCommon Subsequence

Sequence A:Sequence A: nematode_knowledgenematode_knowledgeSequence B:Sequence B: empty_bottleempty_bottle

n e m a t o d e _ k n o w l e d g e n e m a t o d e _ k n o w l e d g e | | | | | | | | | | | | | |

e m p t y _ b o t t l ee m p t y _ b o t t l e

LCS Alignment with match score 1, LCS Alignment with match score 1, mismatch score 0, and gap penalty mismatch score 0, and gap penalty

00

Page 16: Sequence Alignment

What is an algorithm?What is an algorithm? A step-by-step description of the procedures A step-by-step description of the procedures

to accomplish a task.to accomplish a task. Properties:Properties:1.1. Determination of output for each inputDetermination of output for each input

2.2. GeneralityGenerality

3.3. TerminationTermination

Criteria:Criteria: 1.1. Correctness (proof, test, etc.)Correctness (proof, test, etc.)

2.2. Time efficiency (no. of steps is small)Time efficiency (no. of steps is small)

3.3. Space efficiency (spaced used is small)Space efficiency (spaced used is small)

Page 17: Sequence Alignment

Naïve algorithm: exhaustive searchNaïve algorithm: exhaustive search

G C G A A T G G A T T G A G C G TG C G A A T G G A T T G A G C G T

T G A G C C A T T G A T G A C C AT G A G C C A T T G A T G A C C A

ii

jj

Worst case time complexity is ~ 2Worst case time complexity is ~ 2

i j j i j i j j i i . . . . . . . . . . . . . .i j j i j i j j i i . . . . . . . . . . . . . .

sequences of length “n”

2n2n

Page 18: Sequence Alignment

Dynamic programming algorithms for pairwise sequence alignmentDynamic programming algorithms for pairwise sequence alignment

Similar to Longest Common SubsequenceSimilar to Longest Common Subsequence Introduced for biological sequences byIntroduced for biological sequences by

S. B. Needleman & C. D. Wunsch.S. B. Needleman & C. D. Wunsch. A general A general method applicable to the search for similarities method applicable to the search for similarities in the amino acid sequence of two proteins.in the amino acid sequence of two proteins. J. Mol. Biol. 48:J. Mol. Biol. 48:443-453 (1970)443-453 (1970)

Page 19: Sequence Alignment

Dynamic ProgrammingDynamic Programming Optimality substructureOptimality substructure Reduction to a “small” number of sub-problemsReduction to a “small” number of sub-problems Memorization of solutions to sub-problems in a Memorization of solutions to sub-problems in a

tabletable Table look-up and tracingTable look-up and tracing

- G C G C – A T G G A T T G A G C G A- G C G C – A T G G A T T G A G C G A

T G C G C C A T T G A T – G A C C - AT G C G C C A T T G A T – G A C C - A

- G C G C – A T G G A T T G A G C G A- G C G C – A T G G A T T G A G C G A

T G C G C C A T T G A T – G A C C - AT G C G C C A T T G A T – G A C C - A

Optimality Sub-structure

Page 20: Sequence Alignment

二陽指 TM二陽指 TM

G C G A A T G G A T T G A G C G TG C G A A T G G A T T G A G C G T

T G A G C C A T T G A T G A C C AT G A G C C A T T G A T G A C C A

ii

jj

sequences of length “n”

- G C G C – A T G G A T T G A G C G A- G C G C – A T G G A T T G A G C G A

T G C G C C A T T G A T – G A C C - AT G C G C C A T T G A T – G A C C - A

- G C G C – A T G G A T T G A G C G A- G C G C – A T G G A T T G A G C G A

T G C G C C A T T G A T – G A C C - AT G C G C C A T T G A T – G A C C - A

Page 21: Sequence Alignment

Recursive LCSRecursive LCSRecursive LCSRecursive LCS

int int lcs_len lcs_len ( i , j ) { ( i , j ) { if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’ ) return 0 ;if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’ ) return 0 ; else else if (A[ i ] == B[ j ] ) if (A[ i ] == B[ j ] )

return ( 1 + return ( 1 + lcs_len lcs_len ( i+1, j+1 ) ) ;( i+1, j+1 ) ) ; else else

return max ( return max ( lcs_len lcs_len ( i+1, j ) ,( i+1, j ) , lcs_len lcs_len ( i, j+1 )( i, j+1 )

); ); } }

lcs_len( i , j ): length of LCS from i-th position onward in String A and from j-th position onward in String B

Page 22: Sequence Alignment

Reduction to SubproblemsReduction to SubproblemsReduction to SubproblemsReduction to Subproblemsintint lcs_len lcs_len ( String A , String B )( String A , String B ) { return{ return subproblem subproblem ( 0, 0 ); ( 0, 0 ); } }

intint subproblem subproblem ( int( int i i, , int int jj )){ if (A[{ if (A[ ii ] == ‘\0’ ||] == ‘\0’ || B[B[ jj ] == ‘\0’) return 0;] == ‘\0’) return 0; else else

if ( A[if ( A[ ii ] == B[] == B[ jj ] )] ) return (1 +return (1 + subproblemsubproblem (( i+1i+1, , j+1j+1 ));)); else return else return

max (max ( subproblemsubproblem (( i+1i+1, , jj ) ,) , subproblem subproblem (( i i,, j+1j+1 ) );) );

} }

Page 23: Sequence Alignment

Memorizing the solutions :Memorizing the solutions :Memorizing the solutions :Memorizing the solutions :

Matrix L[ i , j ] = -1 ; Matrix L[ i , j ] = -1 ; // initializing the memory device// initializing the memory device

int subproblem ( int i, int j ) { int subproblem ( int i, int j ) {

if ( L[i, j] < 0 ) { if ( L[i, j] < 0 ) {

if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[i , j] = 0;if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[i , j] = 0;

else if ( A[ i ] == B[ j ] ) else if ( A[ i ] == B[ j ] )

L[i, j] = 1 + subproblem(i+1, j+1);L[i, j] = 1 + subproblem(i+1, j+1);

else L[i, j] = max( subproblem(i+1, j),else L[i, j] = max( subproblem(i+1, j),

subproblem(i, j+1));subproblem(i, j+1));

} return L[ i, j ] ; } return L[ i, j ] ;

} }

Page 24: Sequence Alignment

Iterative LCS: Table Look-upIterative LCS: Table Look-upIterative LCS: Table Look-upIterative LCS: Table Look-up

To get the length of LCS of A and BTo get the length of LCS of A and B

{{

first allocate storage for the matrix L; first allocate storage for the matrix L;

for each row for each row ii from from mm downto downto 00

for each column for each column jj from from nn downto downto 00

if (A[ i ] == ‘\0’ or B[ j ] == ‘\0’) L[ i, j ] = 0; if (A[ i ] == ‘\0’ or B[ j ] == ‘\0’) L[ i, j ] = 0;

else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];

else L[ i, j ] = max(L[i+1, j], L[i, j+1]); else L[ i, j ] = max(L[i+1, j], L[i, j+1]);

} }

return L[0, 0]; return L[0, 0];

} }

Page 25: Sequence Alignment

Iterative LCS: Table Look-upIterative LCS: Table Look-upIterative LCS: Table Look-upIterative LCS: Table Look-up

int lcs_len ( String A , String B ) int lcs_len ( String A , String B ) // find the length// find the length

{{

// First allocate storage for the matrix L; // First allocate storage for the matrix L;

for ( i = m ; i >= 0 ; i-- ) for ( i = m ; i >= 0 ; i-- ) // A has length m+1// A has length m+1

for ( j = n ; j >= 0 ; j-- ) { for ( j = n ; j >= 0 ; j-- ) { // B has length n+1// B has length n+1

if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[ i, j ] = 0; if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[ i, j ] = 0;

else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];

else L[ i, j ] = max(L[i+1, j], L[i, j+1]); else L[ i, j ] = max(L[i+1, j], L[i, j+1]);

} }

return L[0, 0]; return L[0, 0];

} }

Page 26: Sequence Alignment

Dynamic Programming AlgorithmDynamic Programming Algorithm

L[L[ii, , jj] = 1 + L[] = 1 + L[i+1i+1, , j+1j+1] , if A[ i ] == B[ j ] ;] , if A[ i ] == B[ j ] ; L[L[ii, , jj] = max ( L[] = max ( L[i+1i+1, , jj], L[], L[ii, , j+1j+1] ) otherwise] ) otherwise

L[i+1, j+1]

L[ i, j ] L[ i, j+1 ]

L[ i+1, j ]

jj j+1j+1

ii

i+1i+1

BBAA

Matrix LMatrix L

Page 27: Sequence Alignment

n  e  m  a  t  o  d  e  _  k  n  o  w  l  e  d  g  e

e   7  7  6  5  5  5  5  5  4  3  3  3  2  2  2  1  1  1  0m   6  6  6  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0p   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0t   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0y   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0_   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0b   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0o   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0l   2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  0e   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  0    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

Page 28: Sequence Alignment

Obtain the subsequenceObtain the subsequence

Sequence S = empty; Sequence S = empty; // the LCS// the LCS

i = 0; j = 0; i = 0; j = 0;

while ( i < m && j < n) { while ( i < m && j < n) {

if ( A[ i ] == B[ j ] ) { if ( A[ i ] == B[ j ] ) {

add A[i] to end of S; add A[i] to end of S;

i++; j++; i++; j++;

} else } else

if ( L[i+1, j] >= L[i, j+1]) i+if ( L[i+1, j] >= L[i, j+1]) i++; +;

else j++; else j++;

} }

Page 29: Sequence Alignment

n e m a t o d e _ k n o w l e d g e

    e  o-o o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o       |  \| | | | |  \| | | | | |  \| |  \|    m  o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o       | |  \| | | | | | | | | | | | | | | |    p  o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o       | | | | | | | | | | | | | | | | | | |    t  o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o       | | | |  \| | | | | | | | | | | | | |    y  o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o       | | | | | | | | | | | | | | | | | | |    _  o-o-o-o-o-o-o-o-o o-o-o-o-o-o-o-o-o-o       | | | | | | | |  \| | | | | | | | | |    b  o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o       | | | | | | | | | | | | | | | | | | |    o  o-o-o-o-o-o o-o-o-o o o o-o-o-o-o-o-o       | | | | |  \| | | | |  | | | | | | |    t  o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o       | | | |  \| | | | | | | | | | | | |    t  o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o       | | | |  \| | | | | | | | | | | | | |    l  o-o-o-o-o-o-o-o-o-o-o-o-o-o o-o-o-o-o       | | | | | | | | | | | | |  \| | | | |    e  o-o o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o       |  \| | | | |  \| | | | | |  \| |  \|       o o o o o o o o o o o o o o o o o o o

Page 30: Sequence Alignment

Dynamic Programming with scores and penalties Dynamic Programming with scores and penalties

      

V A D L …….. K T A

N A K M …. D A L T

jj

ii

xx

yy

Page 31: Sequence Alignment

Dynamic Programming with scores and penalties

Dynamic Programming with scores and penalties

from ‘i-th’ pos. in A and ‘j-th’ pos. in B onwardfrom ‘i-th’ pos. in A and ‘j-th’ pos. in B onward

s ( A[i] , B[j] ) + S[i+1, j+1]s ( A[i] , B[j] ) + S[i+1, j+1]

S[i , j] = S[i , j] = maxmax max { S[i+ max { S[i+xx, j] – w( , j] – w( xx ); );

gap x in sequence B }gap x in sequence B }

max { S[i, j+max { S[i, j+yy] – w( ] – w( yy ); );

gap y in sequence gap y in sequence A }A }

best score from i, j onward

w : penalty function

s : score

Page 32: Sequence Alignment

Algorithm for simple gap penaltyAlgorithm for simple gap penalty

If for each gap, the penalty is a fixed constant If for each gap, the penalty is a fixed constant “c”, then“c”, then

•• s(A[ i ] , B[ j ]) + S[i+1, j+1];s(A[ i ] , B[ j ]) + S[i+1, j+1];

S[i , j] = S[i , j] = maxmax •• S[ i+1, j ] – c ; S[ i+1, j ] – c ; // one gap// one gap

•• S[ i, j+1 ] – c ; S[ i, j+1 ] – c ; // one gap// one gap

Page 33: Sequence Alignment

Table TracingTable Tracing To do table tracing based on To do table tracing based on similarity matrixsimilarity matrix of of amino acids, we re-define S[i , j] to be the optimal amino acids, we re-define S[i , j] to be the optimal score of choosing the match of A[i] with B[j]. score of choosing the match of A[i] with B[j].

S[ i , j ] = s (A[ i ] , B[ j ]) S[ i , j ] = s (A[ i ] , B[ j ]) + + // s : score// s : score

S[i+1, j+1]S[i+1, j+1] // w : gap penalty// w : gap penalty

max { S[i+1+max { S[i+1+xx, j+1] – w( x ); , j+1] – w( x );

+ max + max gap gap xx in sequence B } in sequence B }

max { S[i+1, j+1+max { S[i+1, j+1+yy] – w( y );] – w( y );

gap gap yy in sequence A } in sequence A }

Page 34: Sequence Alignment

DiagramDiagram

s[i, j]s[i, j]

S[i+1,j+1S[i+1,j+1]]

Matrix S:

i

j

i+1

j+1

Page 35: Sequence Alignment

Summation operationSummation operation

1. Start at lower right corner.1. Start at lower right corner.

2. Move diagonally up one position.2. Move diagonally up one position.

3. Find largest value in either 3. Find largest value in either

row segment starting diagonally below current row segment starting diagonally below current position and extending to the right or position and extending to the right or

column segment starting diagonally below column segment starting diagonally below current position and extending down.current position and extending down.

4. Add this value to the value in the current cell.4. Add this value to the value in the current cell.

5. Repeat steps 3 and 4 for all cells to the left in 5. Repeat steps 3 and 4 for all cells to the left in current row and all cells above in current column.current row and all cells above in current column.

6. If we are not in the top left corner, go to step 2.6. If we are not in the top left corner, go to step 2.

Page 36: Sequence Alignment
Page 37: Sequence Alignment
Page 38: Sequence Alignment
Page 39: Sequence Alignment
Page 40: Sequence Alignment
Page 41: Sequence Alignment
Page 42: Sequence Alignment
Page 43: Sequence Alignment
Page 44: Sequence Alignment
Page 45: Sequence Alignment
Page 46: Sequence Alignment
Page 47: Sequence Alignment
Page 48: Sequence Alignment
Page 49: Sequence Alignment
Page 50: Sequence Alignment
Page 51: Sequence Alignment

----V----V

HGQKVHGQKV

Page 52: Sequence Alignment
Page 53: Sequence Alignment

----VA----VA

HGQKVAHGQKVA

Page 54: Sequence Alignment

----VADALTK----VADALTK

HGQKVADALTKHGQKVADALTK

Page 55: Sequence Alignment

----VADALTK----VADALTK

HGQKVADALTKHGQKVADALTK

Page 56: Sequence Alignment

----VADALTKPVNFKFA----VADALTKPVNFKFA

HGQKVADALTK------AHGQKVADALTK------A

Page 57: Sequence Alignment

----VADALTKPVNFKFAVAH----VADALTKPVNFKFAVAH

HGQKVADALTK------AVAHHGQKVADALTK------AVAH

Page 58: Sequence Alignment

Use of dynamic programming to evaluate homology between pairs of sequencesUse of dynamic programming to evaluate homology between pairs of sequences

If we just want to know maximum match If we just want to know maximum match possible between two sequences, then we possible between two sequences, then we don’t need to do trace-back but can just don’t need to do trace-back but can just look at the highest value in the first row or look at the highest value in the first row or column (“match score”). This represents column (“match score”). This represents the best possible alignment score.the best possible alignment score.

Page 59: Sequence Alignment

Gap penalty alternatives :Gap penalty alternatives :

constant gap penalty for gap > 1constant gap penalty for gap > 1 gap penalty proportional to gap size (affine gap penalty proportional to gap size (affine

gap penalty)gap penalty) one penalty for starting a gap (gap opening one penalty for starting a gap (gap opening

penalty)penalty) different (lower) penalty for adding to a gap different (lower) penalty for adding to a gap

(gap extension penalty)(gap extension penalty) dynamic programming algorithm can be made dynamic programming algorithm can be made

more efficientmore efficient

Page 60: Sequence Alignment

Gap penalty alternatives (cont.)Gap penalty alternatives (cont.)

gap penalty proportional to gap size and gap penalty proportional to gap size and sequencesequence for nucleic acids, can be used to mimic for nucleic acids, can be used to mimic

thermodynamics of helix formation.thermodynamics of helix formation. two kinds of gap opening penaltiestwo kinds of gap opening penalties

one for gap closed by AT, different for GC.one for gap closed by AT, different for GC.

different gap extension penalty.different gap extension penalty.

Page 61: Sequence Alignment

End gapsEnd gaps

Some programs treat end gaps as normal Some programs treat end gaps as normal gaps and apply penalties, other programs do gaps and apply penalties, other programs do not apply penalties for end gaps.not apply penalties for end gaps.

Page 62: Sequence Alignment

End gaps (cont.)End gaps (cont.)

Can determine which a program does by adding Can determine which a program does by adding extra (unmatched) bases to the end of one extra (unmatched) bases to the end of one sequence and seeing if match score changes.sequence and seeing if match score changes.

Penalties for end gaps appropriate for aligned Penalties for end gaps appropriate for aligned sequences where ends "should match“.sequences where ends "should match“.

Penalties for end gaps inappropriate when Penalties for end gaps inappropriate when surrounding sequences are expected to be surrounding sequences are expected to be different (e.g., conserved exon surrounded by different (e.g., conserved exon surrounded by varying introns).varying introns).

Page 63: Sequence Alignment

Global vs. Local SimilarityGlobal vs. Local Similarity Should result of alignment include all amino acids or Should result of alignment include all amino acids or

proteins or just those that proteins or just those that matchmatch?? If yes, a global alignment is desiredIf yes, a global alignment is desired If no, a local alignment is desiredIf no, a local alignment is desired

Global alignment is accomplished by including Global alignment is accomplished by including negative scores for “mismatched” positions, negative scores for “mismatched” positions, thus scores get worse as we move away from thus scores get worse as we move away from region of match (local alignment).region of match (local alignment).

Instead of starting trace-back with highest value Instead of starting trace-back with highest value in first row or column, start with highest value in first row or column, start with highest value in entire matrix, stop when score hits zero.in entire matrix, stop when score hits zero.

Page 64: Sequence Alignment

Local AlignmentLocal Alignment From ‘i-th’ pos. in A and ‘j-th’ pos. in B onwardFrom ‘i-th’ pos. in A and ‘j-th’ pos. in B onward

s ( A[i] , B[j] ) + s ( A[i] , B[j] ) + HH[i+1, j+1][i+1, j+1]

HH[i , j] = [i , j] = maxmax max { max { HH[i+[i+xx, j] – w( , j] – w( xx ); );

gap x in sequence B }gap x in sequence B }

max { max { HH[i, j+[i, j+yy] – w( ] – w( yy ); );

gap y in sequence gap y in sequence A }A }

w : penalty function

s : score

Best score of any prefix of the subsequence from i, j onward.

00