Overview of Pairwise Sequence Alignment

1

Overview of Pairwise Sequence Alignment

• Dynamic Programming– Applied to optimization problems

– Useful when• Problem can be recursively divided into sub-problems• Sub-problems are not independent

• Needleman-Wunsch is a global alignment technique that uses an iterative algorithm and no gap penalty (could extend to fixed gap penalty).

• Smith-Waterman is a local alignment technique that uses a recursive algorithm. Smith-Waterman’s algorithm is an extension of Longest Common Substring (LCS) problem and can be generalized to solve both local and global alignment.

報告者：林哲鋒

2

「最長共同子序列」 (LCS, Longest Common Subsequence) 問題

• 首先我們先解釋什麼是子序列 (subsequence) ，所謂子序列就是將一個序列中的一些( 可能是零個 ) 字元去掉所得到的序列，例如： pred 、 sdn 、 predent 等都是 ” president” 的子序列。

• 給定兩序列，最長共同子序列 (LCS) 問題是決定一個子序列，使得 (1) 該子序列是這兩序列的子序列； (2) 它的長度是最長的。

3

LCS

例如：

序列一： president

序列二： providence它的一個 LCS 為 priden ( PResIDENt PRovIDENce )

4

LCS

又例如：

序列一： algorithm

序列二： alignment它的一個 LCS 為 algm or algt ( ALGorithM ALiGnMent )

5

How to compute LCS?

• 給定兩序列及，令 len(i, j) 表示 LCS 之長度，則下列遞迴關係可用來計算 len(i, j) ：

,

. and 0, if)),1(),1,(max(

and 0, if1)1,1(

,0or 0 if0

),(

ji

ji

bajijilenjilen

bajijilen

ji

jilen

6

p r o c e d u r e L C S - L e n g t h ( A , B )

1 . f o r i ← 0 t o m d o l e n ( i , 0 ) = 0

2 . f o r j ← 1 t o n d o l e n ( 0 , j ) = 0

3 . f o r i ← 1 t o m d o

4 . f o r j ← 1 t o n d o

5 . i f ji ba

t h e n

" "),(

1)1,1(),(

jiprev

jilenjilen

6 . e l s e i f )1,(),1( jilenjilen

7 . t h e n

" "),(

),1(),(

jiprev

jilenjilen

8 . e l s e

" "),(

)1,(),(

jiprev

jilenjilen

9 . r e t u r n l e n a n d p r e v

insertion

deletion

7

i j 0 1 p

2 r

3 o

4 v

5 i

6 d

7 e

8 n

9 c

10 e

0 0 0 0 0 0 0 0 0 0 0 0

1 p 2

0 1 1 1 1 1 1 1 1 1 1

2 r 0 1 2 2 2 2 2 2 2 2 2

3 e 0 1 2 2 2 2 2 3 3 3 3

4 s 0 1 2 2 2 2 2 3 3 3 3

5 i 0 1 2 2 2 3 3 3 3 3 3

6 d 0 1 2 2 2 3 4 4 4 4 4

7 e 0 1 2 2 2 3 4 5 5 5 5

8 n 0 1 2 2 2 3 4 5 6 6 6

9 t 0 1 2 2 2 3 4 5 6 6 6

圖: 以LCS-Length計算president與providence的LCS。

8

p r o c e d u r e O u tp u t - L C S (A , p r e v , i , j )

1 i f i = 0 o r j = 0 t h e n r e t u r n

2 i f p r e v ( i , j ) = ” “ t h e n

ia

jiprevALCSOutput

print

)1,1,,(

3 e l s e i f p r e v ( i , j ) = ” “ t h e n O u tp u t - L C S (A , p r e v , i - 1 , j )

4 e l s e O u tp u t - L C S (A , p r e v , i , j - 1 )

9

i j 0 1 p

2 r

3 o

4 v

5 i

6 d

7 e

8 n

9 c

10 e

0 0 0 0 0 0 0 0 0 0 0 0

1 p 2

0 1 1 1 1 1 1 1 1 1 1

2 r 0 1 2 2 2 2 2 2 2 2 2

3 e 0 1 2 2 2 2 2 3 3 3 3

4 s 0 1 2 2 2 2 2 3 3 3 3

5 i 0 1 2 2 2 3 3 3 3 3 3

6 d 0 1 2 2 2 3 4 4 4 4 4

7 e 0 1 2 2 2 3 4 5 5 5 5

8 n 0 1 2 2 2 3 4 5 6 6 6

9 t 0 1 2 2 2 3 4 5 6 6 6

圖: Output-LCS的回溯路線，深色陰影(priden)為LCS

所在。

Output : priden

10

Identification of Common Molecular Subsequences

T. F. SMITE AND M. S. WATERM

J. Mol. Bwl. (1981), 147, 195-197

11

ABSTRACT

• The identification of maximally homologous subsequences among sets of long sequences is an important problem.

• To find a pair of segments, one from each of two long sequences, such that there is no other pair of segments with greater similarity.

12

Algorithm

• two molecular sequences will be A=a1a2 . . . an, and B=b1b2 . . . bm.

• A similarity s(a,b) is given between sequence elements a and b.

• Deletions of length k are given weight Wk

• Set up a matrix H. First set

Hko = Hol = 0 for 0 k n & 0 l m

13

Algorithm cont.

• Hij is the maximum similarity of two segments ending in ai and bj

• These values are obtained from the relationship

14

• (1) If ai and bj are associated, the similarity is

• (2) If ai is at the end of a deletion of length k, the similarity is

• (3) If bj is at the end of a deletion of length I , the similarity is

• (4) Finally, a zero is included to prevent calculated negative similarity, indicating no similarity up to a i and bj

Hij follows by considering the possibilities for ending ,the segments at any ai and bj.

Hi,j-l ─Wl

15

• The pair of segments with maximum similarity is found by first locating the maximum element of H.

• The other matrix elements leading to this maximum value are than sequentially determined with a traceback procedure ending with an element of H equal to zero

16

• in Figure 1.

• A match, ai = bj , s(ai,bj) =1 ,

a mismatch produced a minus one-third.

17

Local VS global alignment

18

Global Alignment vs. Local Alignment

• global alignment:

• local alignment:

19

Global Alignment vs. Local Alignment

),(

),(),(

0

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bwsaws

s

),(

),(

),(

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bws

aws

s

local global

20

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 13 10

0 0 0 0 8 5 2 11 8

0 8 5 2 5 3 13 10 7

0 5 3 0 2 13 10 8 18

C G G A T C A T

C

T

T

A

A

C

T

A – C - TA T C A T8-3+8-3+8 = 18

Local alignment exampleMatch: 8

Mismatch: -5

Gap symbol: -3

21

global alignment

• Needleman Wunsch(1970)• Three steps in dynamic programming• Initialization • Matrix fill (scoring) • Traceback (alignment

• Match: +8 (w(x, y) = 8, if x = y)• Mismatch: -5 (w(x, y) = -5, if x ≠ y)• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

22

C T T A A C – TC G G A T C A T

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

8 – 5 –5 +8 -5 +8 -3 +8 = 14global alignment example1

23

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 -5 -8 -11 -14 -4 -7 -10 -13

-6 -8 3 0 -3 -6 -9 -12 -15

-9 -11 0 11 8 5 2 -1 -4

-12 -14 -3 8 19 16 13 10 7

-15 -11 -6 5 16 14 24 21 18

-18 -7 -9 2 13 11 21 32 29

-21 -10 1 -1 10 8 18 29 27

G A A T C T G C

C

A

A

T

T

G

A

-5 +8 +8 +8 -3 +8 +8 -5 = 27

C A A T - T G AG A A T C T G C global alignment example2

24

Affine gap penalties• A gap of length k is penalized x + k·y.

gap-open penalty

gap-symbol penaltyThree cases for alignment endings:

1. ...x...x

2. ...x...-

3. ...-...x

an aligned pair

a deletion

an insertion

25

Affine gap penalties• Let D(i, j) denote the maximum score of any alig

nment between a1a2…ai and b1b2…bj ending with a deletion.

• Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with an insertion.

• Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.

26

Affine gap penalties

),(

),(

),()1,1(

max),(

)1,(

)1,(max),(

),1(

),1(max),(

jiI

jiD

bawjiS

jiS

yxjiS

yjiIjiI

yxjiS

yjiDjiD

ji

(A gap of length k is penalized x + k·y.)

27

Affine gap penalties

• Match: +8 (w(x, y) = 8, if x = y)• Mismatch: -5 (w(x, y) = -5, if x ≠ y)• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)• Each gap is charged an extra gap-open penalty: -4.

C - - - T T A A C TC G G A T C A - - T

+8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12

-4 -4

Alignment score: 12 – 4 – 4 = 4

28

END

Overview of Pairwise Sequence Alignment

Documents

maximum similarity

greater similarity

similarity is3

similarity is2

similarity is4

calculated negative

iterative algorithm

algorithm cont