CS 5263 Bioinformatics

CS 5263 Bioinformatics

Lecture 4: Local Sequence Alignment, More Efficient

Sequence Alignment Algorithms

Roadmap

• Review of last lecture

• Local sequence alignment

• More efficient sequence alignment algorithms

• Given a scoring scheme, – Match: m– Mismatch: -s– Gap: -d

• We can easily compute an optimal alignment by dynamic programming

• Look at any column of an alignment between two sequences X = x1x2…xM, Y = y1y1…yN

• Only three cases:– xi is aligned to yj

– xi is aligned to a gap

– yj is aligned to a gap

F(i-1, j-1) + (xi, yj)F(i, j) = max F(i-1, j) - d

F(i, j-1) - d

F(0,0)

F(M,N)

F(0,0)

F(M,N)

A

A

G

-

T

T

A

A

Example

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

F(i,j) j = 0 1 2 3 4

i = 0

1

2

3

A

A

G

-

T

T

A

A

Equivalent graph problem

(0,0)

(3,4)

A G T A

A

A

T

1 1

1

1

S1 =

S2 =

• Number of steps: length of the alignment

• Path length: alignment score

• Optimal alignment: find the longest path from (0, 0) to (3, 4)

• General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.

: a gap in the 2nd sequence

: a gap in the 1st sequence

: match / mismatch

Value on vertical/horizontal line: -dValue on diagonal: m or -s

1

Variants of Needleman-Wunsch alg

• LCS: longest common subsequence– No penalty for gaps or mutations– Change: score function

• Overlapping variants– No penalty for starting/ending gaps– Change: initial / termination step

• Other variants– cDNA-genome alignment

Local alignment

The local alignment problem

Given two strings X = x1……xM,

Y = y1……yN

Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum

e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de

x

y

Why local alignment

• Conserved regions may be a small part of the whole– Global alignment might miss them if flanking “junk”

outweighs similar regions

• Genes are shuffled between genomes

A

A

B C D

B CD

Naïve algorithm

for all substrings X’ of X and Y’ of YAlign X’ & Y’ via dynamic

programmingRetain pair with max valueend ;Output the retained pair

• Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.

Reminder

• The overlap detection algorithm– We do not give penalty to gaps in the ends

Free gap

Free gap

The local alignment idea• Do not penalize the unaligned regions (gaps or mismatches)• The alignment can start anywhere and ends anywhere• Strategy: whenever we get to some low similarity region (negative score), we restart a new alignment

– By resetting alignment score to zero

The Smith-Waterman algorithm

Initialization: F(0, j) = F(i, 0) = 0

0

F(i – 1, j) – d

F(i, j – 1) – d

F(i – 1, j – 1) + (xi, yj)

Iteration: F(i, j) = max

The Smith-Waterman algorithm

Termination:

1. If we want the best local alignment…FOPT = maxi,j F(i, j)

2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace

back

x x x c d e

0 0 0 0 0 0 0

a 0

b 0

c 0

x 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0

x 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 0 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 0 0

d 0 1 1 1 1 3 2

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 0 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0

Match: 2

Mismatch: -1

Gap: -1

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 1 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0 2 2 2 1 1 4

Match: 2

Mismatch: -1

Gap: -1

Trace back

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 1 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0 2 2 2 1 1 4

Match: 2

Mismatch: -1

Gap: -1

Trace back

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 1 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0 2 2 2 1 1 4

Match: 2

Mismatch: -1

Gap: -1

cxde| ||c-de

x-de| ||xcde

• No negative values in local alignment DP array

• Optimal local alignment will never have a gap on either end

• Local alignment: “Smith-Waterman”

• Global alignment: “Needleman-Wunsch”

Analysis

• Time: – O(MN) for finding the best alignment– Time to report all alignments depends on the

number of sub-opt alignments

• Memory:– O(MN)– O(M+N) possible

More efficient alignment algorithms

• Given two sequences of length M, N

• Time: O(MN)– ok

• Space: O(MN)– bad– 1Mb seq x 1Mb seq = 1000G memory

• Can we do better?

Bounded alignment

Good alignment should appear near the diagonal

Bounded Dynamic Programming

If we know that x and y are very similar

Assumption: # gaps(x, y) < k

xi Then,| implies | i – j | < k

yj

Bounded Dynamic Programming

Initialization:

F(i,0), F(0,j) undefined for i, j > k

Iteration:For i = 1…M

For j = max(1, i – k)…min(N, i+k)

F(i – 1, j – 1)+ (xi, yj)

F(i, j) = max F(i, j – 1) – d, if j > i – k

F(i – 1, j) – d, if j < i + k

Termination: same

x1 ………………………… xM

y N …

……

……

……

……

… y

1

k

Analysis

• Time: O(kM) << O(MN)

• Space: O(kM) with some tricks

2k

M

2k

=>M

• Given two sequences of length M, N

• Time: O(MN)– ok

• Space: O(MN)– bad– 1mb seq x 1mb seq = 1000G memory

• Can we do better?

Linear space algorithm

• If all we need is the alignment score but not the alignment, easy!

We only need to keep two rows

(You only need one row, with a little trick)

But how do we get the alignment?

Linear space algorithm

• When we finish, we know how we have aligned the ends of the sequences

Naïve idea: Repeat on the smaller subproblem F(M-1, N-1)

Time complexity: O((M+N)(MN))

XM

YN

(0, 0)

(M, N)

M/2

Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown.

• Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(6,6,3,k))

(0,0)

(6,6)

(3,2) (3,4) (3,6)(3,0)

Hirschberg’s idea

• Divide and conquer!

M/2 F(M/2, k) represents the best alignment between x1x2…xM/2 and y1y2…yk

Forward algorithmAlign x1x2…xM/2 with Y

X

Y

Backward Algorithm

M/2

B(M/2, k) represents the best alignment between reverse(xM/2+1…xM) and reverse(ykyk+1…yN )

Backward algorithmAlign reverse(xM/2+1…xM) with reverse(Y)

Y

X

Linear-space alignment

Using 2 (4) rows of space, we can compute

for k = 1…N, F(M/2, k), B(M/2, k)

M/2

Linear-space alignment

Now, we can find k* maximizing F(M/2, k) + B(M/2, k)

Also, we can trace the path exiting column M/2 from k*

Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2

Linear-space alignment• Iterate this procedure to the two sub-problems!

N-k*

M/2

M/2

k*

Analysis

• Memory: O(N) for computation, O(N+M) to store the optimal alignment

• Time: – MN for first iteration– k M/2 + (N-k) M/2 = MN/2 for second– …

k

N-k

M/2

M/2

MN MN/2 MN/4

MN/8

MN + MN/2 + MN/4 + MN/8 + … = MN (1 + ½ + ¼ + 1/8 + 1/16 + …)= 2MN = O(MN)

CS 5263 Bioinformatics

Documents

j d fi

j d fi

mismatchesthe alignment

wholeglobal alignment

substrings x of x

y of yalign x y

max fi

abcxdex x