CS 5263 Bioinformatics Lecture 4: Local Sequence Alignment, More Efficient Sequence Alignment Algorithms
Jan 23, 2016
CS 5263 Bioinformatics
Lecture 4: Local Sequence Alignment, More Efficient
Sequence Alignment Algorithms
Roadmap
• Review of last lecture
• Local sequence alignment
• More efficient sequence alignment algorithms
• Given a scoring scheme, – Match: m– Mismatch: -s– Gap: -d
• We can easily compute an optimal alignment by dynamic programming
• Look at any column of an alignment between two sequences X = x1x2…xM, Y = y1y1…yN
• Only three cases:– xi is aligned to yj
– xi is aligned to a gap
– yj is aligned to a gap
F(i-1, j-1) + (xi, yj)F(i, j) = max F(i-1, j) - d
F(i, j-1) - d
F(0,0)
F(M,N)
F(0,0)
F(M,N)
A
A
G
-
T
T
A
A
Example
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
F(i,j) j = 0 1 2 3 4
i = 0
1
2
3
A
A
G
-
T
T
A
A
Equivalent graph problem
(0,0)
(3,4)
A G T A
A
A
T
1 1
1
1
S1 =
S2 =
• Number of steps: length of the alignment
• Path length: alignment score
• Optimal alignment: find the longest path from (0, 0) to (3, 4)
• General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.
: a gap in the 2nd sequence
: a gap in the 1st sequence
: match / mismatch
Value on vertical/horizontal line: -dValue on diagonal: m or -s
1
Variants of Needleman-Wunsch alg
• LCS: longest common subsequence– No penalty for gaps or mutations– Change: score function
• Overlapping variants– No penalty for starting/ending gaps– Change: initial / termination step
• Other variants– cDNA-genome alignment
Local alignment
The local alignment problem
Given two strings X = x1……xM,
Y = y1……yN
Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum
e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de
x
y
Why local alignment
• Conserved regions may be a small part of the whole– Global alignment might miss them if flanking “junk”
outweighs similar regions
• Genes are shuffled between genomes
A
A
B C D
B CD
Naïve algorithm
for all substrings X’ of X and Y’ of YAlign X’ & Y’ via dynamic
programmingRetain pair with max valueend ;Output the retained pair
• Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.
Reminder
• The overlap detection algorithm– We do not give penalty to gaps in the ends
Free gap
Free gap
The local alignment idea• Do not penalize the unaligned regions (gaps or mismatches)• The alignment can start anywhere and ends anywhere• Strategy: whenever we get to some low similarity region (negative score), we restart a new alignment
– By resetting alignment score to zero
The Smith-Waterman algorithm
Initialization: F(0, j) = F(i, 0) = 0
0
F(i – 1, j) – d
F(i, j – 1) – d
F(i – 1, j – 1) + (xi, yj)
Iteration: F(i, j) = max
The Smith-Waterman algorithm
Termination:
1. If we want the best local alignment…FOPT = maxi,j F(i, j)
2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace
back
x x x c d e
0 0 0 0 0 0 0
a 0
b 0
c 0
x 0
d 0
e 0
x 0
Match: 2
Mismatch: -1
Gap: -1
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0
x 0
d 0
e 0
x 0
Match: 2
Mismatch: -1
Gap: -1
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0
d 0
e 0
x 0
Match: 2
Mismatch: -1
Gap: -1
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0
e 0
x 0
Match: 2
Mismatch: -1
Gap: -1
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0 1 1 1 1 3 2
e 0
x 0
Match: 2
Mismatch: -1
Gap: -1
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0
Match: 2
Mismatch: -1
Gap: -1
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match: 2
Mismatch: -1
Gap: -1
Trace back
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match: 2
Mismatch: -1
Gap: -1
Trace back
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match: 2
Mismatch: -1
Gap: -1
cxde| ||c-de
x-de| ||xcde
• No negative values in local alignment DP array
• Optimal local alignment will never have a gap on either end
• Local alignment: “Smith-Waterman”
• Global alignment: “Needleman-Wunsch”
Analysis
• Time: – O(MN) for finding the best alignment– Time to report all alignments depends on the
number of sub-opt alignments
• Memory:– O(MN)– O(M+N) possible
More efficient alignment algorithms
• Given two sequences of length M, N
• Time: O(MN)– ok
• Space: O(MN)– bad– 1Mb seq x 1Mb seq = 1000G memory
• Can we do better?
Bounded alignment
Good alignment should appear near the diagonal
Bounded Dynamic Programming
If we know that x and y are very similar
Assumption: # gaps(x, y) < k
xi Then,| implies | i – j | < k
yj
Bounded Dynamic Programming
Initialization:
F(i,0), F(0,j) undefined for i, j > k
Iteration:For i = 1…M
For j = max(1, i – k)…min(N, i+k)
F(i – 1, j – 1)+ (xi, yj)
F(i, j) = max F(i, j – 1) – d, if j > i – k
F(i – 1, j) – d, if j < i + k
Termination: same
x1 ………………………… xM
y N …
……
……
……
……
… y
1
k
Analysis
• Time: O(kM) << O(MN)
• Space: O(kM) with some tricks
2k
M
2k
=>M
• Given two sequences of length M, N
• Time: O(MN)– ok
• Space: O(MN)– bad– 1mb seq x 1mb seq = 1000G memory
• Can we do better?
Linear space algorithm
• If all we need is the alignment score but not the alignment, easy!
We only need to keep two rows
(You only need one row, with a little trick)
But how do we get the alignment?
Linear space algorithm
• When we finish, we know how we have aligned the ends of the sequences
Naïve idea: Repeat on the smaller subproblem F(M-1, N-1)
Time complexity: O((M+N)(MN))
XM
YN
(0, 0)
(M, N)
M/2
Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown.
• Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(6,6,3,k))
(0,0)
(6,6)
(3,2) (3,4) (3,6)(3,0)
Hirschberg’s idea
• Divide and conquer!
M/2 F(M/2, k) represents the best alignment between x1x2…xM/2 and y1y2…yk
Forward algorithmAlign x1x2…xM/2 with Y
X
Y
Backward Algorithm
M/2
B(M/2, k) represents the best alignment between reverse(xM/2+1…xM) and reverse(ykyk+1…yN )
Backward algorithmAlign reverse(xM/2+1…xM) with reverse(Y)
Y
X
Linear-space alignment
Using 2 (4) rows of space, we can compute
for k = 1…N, F(M/2, k), B(M/2, k)
M/2
Linear-space alignment
Now, we can find k* maximizing F(M/2, k) + B(M/2, k)
Also, we can trace the path exiting column M/2 from k*
Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2
Linear-space alignment• Iterate this procedure to the two sub-problems!
N-k*
M/2
M/2
k*
Analysis
• Memory: O(N) for computation, O(N+M) to store the optimal alignment
• Time: – MN for first iteration– k M/2 + (N-k) M/2 = MN/2 for second– …
k
N-k
M/2
M/2
MN MN/2 MN/4
MN/8
MN + MN/2 + MN/4 + MN/8 + … = MN (1 + ½ + ¼ + 1/8 + 1/16 + …)= 2MN = O(MN)