Top Banner
CS 5263 Bioinformatics Lecture 4: Local Sequence Alignment, More Efficient Sequence Alignment Algorithms
47

CS 5263 Bioinformatics

Jan 23, 2016

Download

Documents

fathia

CS 5263 Bioinformatics. Lecture 4: Local Sequence Alignment, More Efficient Sequence Alignment Algorithms. Roadmap. Review of last lecture Local sequence alignment More efficient sequence alignment algorithms. Given a scoring scheme, Match: m Mismatch: -s Gap: -d - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS 5263 Bioinformatics

CS 5263 Bioinformatics

Lecture 4: Local Sequence Alignment, More Efficient

Sequence Alignment Algorithms

Page 2: CS 5263 Bioinformatics

Roadmap

• Review of last lecture

• Local sequence alignment

• More efficient sequence alignment algorithms

Page 3: CS 5263 Bioinformatics

• Given a scoring scheme, – Match: m– Mismatch: -s– Gap: -d

• We can easily compute an optimal alignment by dynamic programming

Page 4: CS 5263 Bioinformatics

• Look at any column of an alignment between two sequences X = x1x2…xM, Y = y1y1…yN

• Only three cases:– xi is aligned to yj

– xi is aligned to a gap

– yj is aligned to a gap

F(i-1, j-1) + (xi, yj)F(i, j) = max F(i-1, j) - d

F(i, j-1) - d

Page 5: CS 5263 Bioinformatics

F(0,0)

F(M,N)

Page 6: CS 5263 Bioinformatics

F(0,0)

F(M,N)

Page 7: CS 5263 Bioinformatics

A

A

G

-

T

T

A

A

Example

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

F(i,j) j = 0 1 2 3 4

i = 0

1

2

3

A

A

G

-

T

T

A

A

Page 8: CS 5263 Bioinformatics

Equivalent graph problem

(0,0)

(3,4)

A G T A

A

A

T

1 1

1

1

S1 =

S2 =

• Number of steps: length of the alignment

• Path length: alignment score

• Optimal alignment: find the longest path from (0, 0) to (3, 4)

• General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.

: a gap in the 2nd sequence

: a gap in the 1st sequence

: match / mismatch

Value on vertical/horizontal line: -dValue on diagonal: m or -s

1

Page 9: CS 5263 Bioinformatics

Variants of Needleman-Wunsch alg

• LCS: longest common subsequence– No penalty for gaps or mutations– Change: score function

• Overlapping variants– No penalty for starting/ending gaps– Change: initial / termination step

• Other variants– cDNA-genome alignment

Page 10: CS 5263 Bioinformatics

Local alignment

Page 11: CS 5263 Bioinformatics

The local alignment problem

Given two strings X = x1……xM,

Y = y1……yN

Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum

e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de

x

y

Page 12: CS 5263 Bioinformatics

Why local alignment

• Conserved regions may be a small part of the whole– Global alignment might miss them if flanking “junk”

outweighs similar regions

• Genes are shuffled between genomes

A

A

B C D

B CD

Page 13: CS 5263 Bioinformatics

Naïve algorithm

for all substrings X’ of X and Y’ of YAlign X’ & Y’ via dynamic

programmingRetain pair with max valueend ;Output the retained pair

• Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.

Page 14: CS 5263 Bioinformatics

Reminder

• The overlap detection algorithm– We do not give penalty to gaps in the ends

Free gap

Free gap

Page 15: CS 5263 Bioinformatics

The local alignment idea• Do not penalize the unaligned regions (gaps or mismatches)• The alignment can start anywhere and ends anywhere• Strategy: whenever we get to some low similarity region (negative score), we restart a new alignment

– By resetting alignment score to zero

Page 16: CS 5263 Bioinformatics

The Smith-Waterman algorithm

Initialization: F(0, j) = F(i, 0) = 0

0

F(i – 1, j) – d

F(i, j – 1) – d

F(i – 1, j – 1) + (xi, yj)

Iteration: F(i, j) = max

Page 17: CS 5263 Bioinformatics

The Smith-Waterman algorithm

Termination:

1. If we want the best local alignment…FOPT = maxi,j F(i, j)

2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace

back

Page 18: CS 5263 Bioinformatics

x x x c d e

0 0 0 0 0 0 0

a 0

b 0

c 0

x 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

Page 19: CS 5263 Bioinformatics

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0

x 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

Page 20: CS 5263 Bioinformatics

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

Page 21: CS 5263 Bioinformatics

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 0 0

d 0

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

Page 22: CS 5263 Bioinformatics

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 0 0

d 0 1 1 1 1 3 2

e 0

x 0

Match: 2

Mismatch: -1

Gap: -1

Page 23: CS 5263 Bioinformatics

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 0 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0

Match: 2

Mismatch: -1

Gap: -1

Page 24: CS 5263 Bioinformatics

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 1 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0 2 2 2 1 1 4

Match: 2

Mismatch: -1

Gap: -1

Page 25: CS 5263 Bioinformatics

Trace back

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 1 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0 2 2 2 1 1 4

Match: 2

Mismatch: -1

Gap: -1

Page 26: CS 5263 Bioinformatics

Trace back

x x x c d e

0 0 0 0 0 0 0

a 0 0 0 0 0 0 0

b 0 0 0 0 0 0 0

c 0 0 0 0 2 1 0

x 0 2 2 2 1 1 0

d 0 1 1 1 1 3 2

e 0 0 0 0 0 2 5

x 0 2 2 2 1 1 4

Match: 2

Mismatch: -1

Gap: -1

cxde| ||c-de

x-de| ||xcde

Page 27: CS 5263 Bioinformatics

• No negative values in local alignment DP array

• Optimal local alignment will never have a gap on either end

• Local alignment: “Smith-Waterman”

• Global alignment: “Needleman-Wunsch”

Page 28: CS 5263 Bioinformatics

Analysis

• Time: – O(MN) for finding the best alignment– Time to report all alignments depends on the

number of sub-opt alignments

• Memory:– O(MN)– O(M+N) possible

Page 29: CS 5263 Bioinformatics

More efficient alignment algorithms

Page 30: CS 5263 Bioinformatics

• Given two sequences of length M, N

• Time: O(MN)– ok

• Space: O(MN)– bad– 1Mb seq x 1Mb seq = 1000G memory

• Can we do better?

Page 31: CS 5263 Bioinformatics

Bounded alignment

Good alignment should appear near the diagonal

Page 32: CS 5263 Bioinformatics

Bounded Dynamic Programming

If we know that x and y are very similar

Assumption: # gaps(x, y) < k

xi Then,| implies | i – j | < k

yj

Page 33: CS 5263 Bioinformatics

Bounded Dynamic Programming

Initialization:

F(i,0), F(0,j) undefined for i, j > k

Iteration:For i = 1…M

For j = max(1, i – k)…min(N, i+k)

F(i – 1, j – 1)+ (xi, yj)

F(i, j) = max F(i, j – 1) – d, if j > i – k

F(i – 1, j) – d, if j < i + k

Termination: same

x1 ………………………… xM

y N …

……

……

……

……

… y

1

k

Page 34: CS 5263 Bioinformatics

Analysis

• Time: O(kM) << O(MN)

• Space: O(kM) with some tricks

2k

M

2k

=>M

Page 35: CS 5263 Bioinformatics
Page 36: CS 5263 Bioinformatics

• Given two sequences of length M, N

• Time: O(MN)– ok

• Space: O(MN)– bad– 1mb seq x 1mb seq = 1000G memory

• Can we do better?

Page 37: CS 5263 Bioinformatics

Linear space algorithm

• If all we need is the alignment score but not the alignment, easy!

We only need to keep two rows

(You only need one row, with a little trick)

But how do we get the alignment?

Page 38: CS 5263 Bioinformatics

Linear space algorithm

• When we finish, we know how we have aligned the ends of the sequences

Naïve idea: Repeat on the smaller subproblem F(M-1, N-1)

Time complexity: O((M+N)(MN))

XM

YN

Page 39: CS 5263 Bioinformatics

(0, 0)

(M, N)

M/2

Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown.

Page 40: CS 5263 Bioinformatics

• Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(6,6,3,k))

(0,0)

(6,6)

(3,2) (3,4) (3,6)(3,0)

Page 41: CS 5263 Bioinformatics

Hirschberg’s idea

• Divide and conquer!

M/2 F(M/2, k) represents the best alignment between x1x2…xM/2 and y1y2…yk

Forward algorithmAlign x1x2…xM/2 with Y

X

Y

Page 42: CS 5263 Bioinformatics

Backward Algorithm

M/2

B(M/2, k) represents the best alignment between reverse(xM/2+1…xM) and reverse(ykyk+1…yN )

Backward algorithmAlign reverse(xM/2+1…xM) with reverse(Y)

Y

X

Page 43: CS 5263 Bioinformatics

Linear-space alignment

Using 2 (4) rows of space, we can compute

for k = 1…N, F(M/2, k), B(M/2, k)

M/2

Page 44: CS 5263 Bioinformatics

Linear-space alignment

Now, we can find k* maximizing F(M/2, k) + B(M/2, k)

Also, we can trace the path exiting column M/2 from k*

Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2

Page 45: CS 5263 Bioinformatics

Linear-space alignment• Iterate this procedure to the two sub-problems!

N-k*

M/2

M/2

k*

Page 46: CS 5263 Bioinformatics

Analysis

• Memory: O(N) for computation, O(N+M) to store the optimal alignment

• Time: – MN for first iteration– k M/2 + (N-k) M/2 = MN/2 for second– …

k

N-k

M/2

M/2

Page 47: CS 5263 Bioinformatics

MN MN/2 MN/4

MN/8

MN + MN/2 + MN/4 + MN/8 + … = MN (1 + ½ + ¼ + 1/8 + 1/16 + …)= 2MN = O(MN)