Top Banner
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College
28

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Jan 05, 2016

Download

Documents

Steven Bell
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Sequence Comparison Algorithms

Ellen Walker

Bioinformatics

Hiram College

Page 2: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

The Problem

• We have two sequences that we want to compare, based on edit distance

• Edit distance = number of changes to get from one string to the other– Insertions– Deletions– Changes

Page 3: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Example

• LOVE => MONEY

• 1. Replace L by M

• 2. Replace V by N

• 3. Add Y at the end

L O V E –

M O N E Y

Page 4: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Brute Force Solution

• Try all possible alignments between the strings

• Looking at one string, – Every possible shift (space before or after)– Every possible gap (space within)– Gaps of various lengths, bounded by the

size of the longest string

Page 5: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

How many possibilities are there?

• Consider only single insertions:• _ M _ O _ N _ E _ Y_

– There are N+1 places to insert, where N is the length of the string

• At each place you have 2 choices (insert or not)– Therefore, just this subset is already 2N

– So, brute force is exponential!

Page 6: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Dynamic Programming

• Score possibilities in an alignment matrix

• Value of any square in the matrix depends on:– Value above (if “vertical gap”)– Value beside (if “horizontal gap”)– Value diagonally above (if match or

mismatch)

Page 7: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Global Alignment MatrixM O N E Y

0 ––

-1

––

-2

––

-3

––

-4

––

-5

L | -1 \

-1

\

-2

––

-3

––

-4

––

-4

O | -2 \

-2

\

0

––

-1

––

-2

––

-3

V | -3 \

-3

| -1 \

-1

\

-2

––

-3

E | -4 \

-4

| -2 \

-2

\

0

––

-1

Page 8: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Local Alignment MatrixM O N E Y

0 0 0 0 0 0

L 0 0 0 0 0 0

O 0 0 \

1

0 0 0

V 0 0 0 0 0 0

E 0 0 0 0 \

1

0

Page 9: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Computing the Alignment Matrix

• For each square:– Take minimum of vertical gap, horizontal

gap, (mis)match score : O(1)

• There are N*M squares, where N and M are the lengths of the strings

• Therefore, time and space are both O(N*M) or (for short) O(N2)

Page 10: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

But, what is N?

• If we’re matching genomes, N is huge!

• N2 is too much time and space!

• How can we save further?

Page 11: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Ordering the Computations

• Each cell can be computed when the ones above, diagonally above, and to the left are computed– Left-to-right, top to bottom (row major)– Top-to-bottom, left to right (column major)– Across a diagonal wavefront

Page 12: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Saving Space: Row Major

• A row major computation really only needs two rows (the one above, and the current row).

• After each computation, the current row becomes the row above

• Savings: space is O(N) instead of O(N2)• Cost: Insufficient information for traceback

– Do a new alignment, limited to a region around the result.

Page 13: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Saving Time: Wavefront

• Use a parallel processor (effectively N machines at a time)

• Each reverse diagonal is computed at once• Time is now O(N), but cost is N processors

instead of 1• Computer science theoretician would say “no

savings”, but if you’re the one waiting, you might disagree!

Page 14: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Saving More Time: Partial Search

• In local alignment, large areas have 0’s.

• Mismatches adjacent to 0’s are also 0’s.

• To get “reasonably large” values, you need longer sequences (BLAST “words”) in common

• So, only search near where there are common subsequences

Page 15: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Finding Common Subsequences

• Pick a sequence length.

• For each subsequence of that length, find all occurrences in each sequence

• If i is the index in one sequence and j is the index in the other sequence, then fill in the region of the alignment matrix near (i, j) (i,j) is called the seed

Page 16: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

BLAST’s Generalization

• Consider a threshold T and a sequence S

• The neighborhood of the sequence S is all sequences that score at or better than T against S

• BLAST uses neighborhoods to set seeds (areas of the alignment matrix to explore)

Page 17: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Consequences of Choices

• Higher T’s are faster, but ignore more potential matches

• Longer sequences are less common– Smaller neighborhoods for a given T– Fewer areas to search– More likelihood of missing good alignments

Page 18: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

T vs Sequence Size

• Longer sequences have higher maximum scores (unless normalized)

• But, longer sequences (tend to) have more likelihood of mismatches?

Page 19: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Too Many Seeds

• If we pick a sequence length and threshold that is sufficiently sensitive, we still might have too many seeds for reasonable alignment times.

• Two-seed solution:– Only consider areas of the table that

contain two seeds (diagonals) separated by a limited distance

Page 20: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Extending Alignments

• A seed region is a small alignment• We want to “grow” the alignments

(especially if we can connect to others(!))

• To grow an alignment, use Smith-Waterman to compute neighboring values

• Question: when to stop growing?

Page 21: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Score Changes During Growing

• As an alignment is extended, its score changes– Score increases when sub-matches

connect– Score decreases when extended into

unrelated area

• Often score must decrease before increasing!

Page 22: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

When to Stop?

• Consider current score, compared to maximum score so far

• When the current score gets sufficiently small relative to the maximum, then stop

• This is another parameter with a tradeoff (stop too soon and get smaller results, stop too late and do useless work)

Page 23: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

One more “trick”

• Suppose that there is a “standard” sequence that many people want to align against

• Run the seeding algorithm with different sequence lengths and thresholds and save the resulting seed locations

• When someone does a search, the seeding part has already been done

Page 24: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Offline vs. Online Algorithms

• Offline Algorithms– Execute “standardized” part of algorithm in

advance, and save result– This is like compilation of a program

• Online Algorithm– Use the tables or databases you built offline to

answer a specific query– This is like running a program– User sees only time taken by Online Algorithm

Page 25: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Common Offline/Online Applications

• Web searching– Offline: build indexes of sites vs. keywords– Online: retrieve sites from the index

• Neural networks– Offline: train the network on many

examples of the problem, set the weights– Online: run the network once (with fixed

weights) on the specific example

Page 26: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Summary

• Smith Waterman is exact, accurate, and time-consuming (even though it uses dynamic programming to get down to O(N2)

• BLAST speeds up the search process, but is no longer exact, so it can miss good alignments (even the best one!)

Page 27: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Using BLAST Well

• Importance of setting parameters– Sequence length– Score threshold– Distance (for two-hit method)– Stopping condition (for growing seeded

alignments)

Page 28: Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Exercises

• Given the BLOSUM62 matrix at http://www.ncbi.nlm.nih.gov/Class/BLAST/BLOSUM62.txt– What is the neighborhood of HID with

threshold 5? 10? 15?

• Create two random sequences of 20 bases each (flip two coins for each base: HH=A, TT=T, HT=C, TH=G)