Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky
Outline
• Alignment re-cap• End-space free alignment• Affine gap alignment algorithm and proof
• Bounded gap/spaces alignments
Dynamic programming
• Useful in many string-related settings
• Will be repeatedly used in the course
• General idea• Confine the exponential number of possibilities into some “hierarchy”, such that the number of cases becomes polynomial
Dynamic programming for shortest paths
• Finding the shortest path from X to Y using the Floyd Warshall
• Idea: if we know what is the shortest path using intermediate vertices {1,…, k-1}, computing shortest paths using {1,…, k} is easy
wij if k=0
• dij(k)= min{dij
(k-1), dik(k-1)+dkj
(k-1)} otherwise
Alignment reminder
Something1|G
Something2|C
Something1|GSomething2|C
Somethin g1|GSomething2C|-
Something1|GSomething1|C
Something1|GSomething1|C
Something1|GSomething1|C
Something1G|-Somethin g2|C
Global alignment
• Input: S1,S2
• Output: Minimum cost alignment• V(k,l) – score of aligning S1[1..k] with S2[1..l]
• Base conditions: • V(i,0) = k=0..i(sk,-)• V(0,j) = k=0..j(-,tk)
• Recurrence relation: V(i-1,j-1) + (si,tj)
1in, 1jm: V(i,j) = max V(i-1,j) + (si,-)
V(i,j-1) + (-,tj)
Alignment reminder
• Global alignment• All of S1 has to be aligned with all of S2
• Every gap is “payed for”• Solution equals V(n,m)
Alignment score here
Traceback all the way
Local alignment
• Local alignment• Subset of S1 aligned with a subset of S2
• Gaps outside subsets “costless”• Solution equals the maximum score cell in the DP matrix
• Base conditions: • V(i,0) = 0• V(0,j) = 0
• Recurrence relation: V(i-1,j-1) + (si,tj)
1in, 1jm: V(i,j) = max V(i-1,j) + (si,-)
V(i,j-1) + (-,tj)0
Ends-free alignment
• Something between global and local• Consider aligning a gene to a (bacterial) genome• Gaps in the beginning and end of S and T are costless
• But all of S,T should be aligned• Base conditions:
• V(i,0) = 0• V(0,j) = 0
• Recurrence relation: V(i-1,j-1) + (si,tj)
1in, 1jm: V(i,j) = max V(i-1,j) + (si,-)
V(i,j-1) + (-,tj)
• The optimal solution is found at the last row/column
(not necessarily at bottom right corner)
Handling weird gaps
• Affine gap: different cost for a “new” and “old” gaps
Something1|G
Something2|C
Something1|GSomething2|C
Somethin g1|GSomething2C|-
Something1|GSomething1|C
Something1|GSomething1|C
Something1|GSomething1|C
Something1G|-Somethin g2|C
Now we care if there were gaps here
Two new things to keep track Two additional matrices
Alignment with Affine Gap Penalty
Base Conditions:V(i, 0) = F(i, 0) = Wg + iWs
V(0, j) = E(0, j) = Wg + jWs
Recursive Computation:V(i, j) = max{ E(i, j), F(i, j), G(i, j)}
where:• G(i, j) = V(i-1, j-1) + (si, tj)• E(i, j) = max{ E(i, j-1) + Ws , G(i, j-1) + Wg + Ws , F(i, j-1) + Wg + Ws }
• F(i, j) = max{ F(i-1, j) + Ws , G(i-1, j) + Wg + Ws , E(i-1, j) + Wg + Ws }
S.....iT.....j
S.....i------T..............j
S...............iT.....j-------
G(i,j)
E(i,j)
• Time complexity O(nm) - compute 4 matrices instead of one.• Space complexity O(nm) - saving 3 (Why?) matrices. O(n+m) w/ Hir.
When do constant and affine gap costs differ?
• Consider:AGAGACTGACGCTTA
ATATTA
AGAGACTGACGCTTA
----A-T-A---TTA
Constant penalty: Mismatch: -5Gap: -1
AGAGACTGACGCTTA
ATA---------TTA
Affine penalty: Mismatch: -5Gap open: -3Gap extend: -0.5
-9 -14
-12-14.5
Bounding the number of gaps
• Lets say we are allowed to have at most K gaps
• (Gaps ≠ Spaces Gap can contain many spaces)
• Now we keep track of the number of gaps we opened so far
• Also still need to keep track of whether a gap is currently open in S or T (E/F matrices)
Bounding the number of gaps
• A “multi-layer” DP matrix• Actually separate functions – V,E,F, on every layer, keeping track of layer no.
• Every time we open or close a gap we “jump” to the next layer
• Where to look for the solution? (not only
at last layer!)• What is the complexity?
Bounding the number of spaces
• Let’s say that no gap can exceed k spaces
• Of course now cannot also bound number
of gaps as well (why?)• How many matrices do we need now?• Here, no monotone notion of layer like before
• What’s the complexity?
What about arbitrary gap functions?
• If the gap cost is an arbitrary function of its length f(k)
• Thus, when computing Dij, we need to look at j places “back” and i places “up”:
• Complexity?
Something1|GSomething1|C
( 1, 1) ( , )
( , ) min ( ( , ) ( , -) ( )}
min ( ( , ) (-, ) ( )k i
k j
V i j i j
V i j V i k j s f k
V i j k t f k
− − += − + +
− + +min
Special cases
• How about a logarithmic penalty? Wg+Ws*log(k)
• This is a special case of a convex penalty, which is solvable in O(mn*log(m))
• The logarithmic case can be done in O(mn)
• For a piece-wise linear gap function made of K lines, DP can be done in O(mn*log(K))
Supersequence
• Exercise: A is called a non-contiguous supersequence of B if B is a non-contiguous subsequence of A.
• e.g., YABADABADU is a non-contigous supersequence of BABU (YABADABADU)
• Given S and T, find their shortest common supersequence
Reminder: LCS
• Longest common non-contigous subsequence:• Adjust global alignment with similarity scores
• 1 for match• 0 for gaps • -∞ for mismatches
Supersequence
• Find the longest common sub-sequence of S,T
• Generate the string as follows: • for every column in the alignment• Match – add the matching character (once!)
• Gap – add the character aligned against the gap
Supersequence
• For S=“Pride” T=“Parade”: • P-R-IDE• PARA-DE• PARAIDE – Shortest common supersequence
Exercise: Finding repeats
• Basic objective: find a pair of subsequences within S with maximum similarity
• Simple (albeit wrong) idea: Find an optimal alignment of S with itself! (Why wrong?)
• But using local alignment is still a good idea
Variant #1
• Specific requirement: the two sequences may overlap
• Solution: Change the local alignment algorithm:• Compute only the upper triangular submatrix (V(i,j), where j>i).
• Set diagonal values to 0
• Complexity: O(n2) time and O(n) space
Variant #2
• Specific requirement: the two sequences may not overlap
• Solution: Absence of overlap means that k exists such that one string is in S[1..k] and another in S[k+1..n] • Check local alignments between S[1..k] and S[k+1..n] for any 1<=k<n
• Pick the highest-scoring alignment
• Complexity: O(n3) time and O(n) space
Variant #3
• Specific requirement: the two sequences must be consequtive (tandem repeat)
• Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between S[1..k] and S[k+1..n], • No penalties for gaps in the beginning of S[1..k]
• No penalties for gaps in the end of S[k+1..n]
• Complexity: O(n3) time and O(n) space
Variant #4
• Specific requirement: the two sequences must be consequtive and the similarity is measured between the first sequence and the reverse complement of the second - SRC (inverted repeat)
• Tempting (albeit wrong) to use something in the spirit of variant #3 – will give complexity O(n3)