Top Banner
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey
57

Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Algorithms in Bioinformatics

Lawrence D’Antonio

Ramapo College of New Jersey

Page 2: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Topics

• Algorithm basics

• Types of algorithms in bioinformatics

• Sequence alignment

• Database Searches

Page 3: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Algorithm basics

• What is an algorithm?

• Algorithm complexity

• P vs. NP

• NP completeness

Page 4: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

What is an algorithm?

• An algorithm is a step-by-step procedure to solve a problem

• The word “algorithm” comes from the 9th century Islamic mathematician al-Khwarizmi

Page 5: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Algorithm Complexity

• If the algorithm works with n pieces of data and the number of steps is proportional to n, then we say that the running time is O(n).

• If the number of steps is proportional to log n, then the running time is O(log n).

Page 6: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Example

• Problem: find the largest element in a sequence of n elements.

• Solution idea: Iteratively compare size of elements in sequence.

Page 7: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Algorithm:

1. Initialize first element as largest.

2. For each remaining element.

If current element larger than largest, make that element largest.

Running time: O(n)

Page 8: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Polynomial Time

• An algorithm is said to run in polynomial time if its running time can be written in the form O(nk) for some power k.

• The underlying problem is said to be of class P.

Page 9: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Polynomial Time Examples

• Searching

Binary Search: O(log n)

• Sorting

Quick Sort: O(n log n)

Page 10: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

NP Algorithms

• An algorithm is nondeterministic if it begins with guessing a solution to the problem and then verifies the guess.

• A problem is of category NP if there is a nondeterministic algorithm for that problem which runs in polynomial time.

Page 11: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

NP Complete

• A problem is NP-complete if it has an NP algorithm, and solutions to this problem can be used to solve all other NP problems.

• A problem is NP-hard if it is at least as hard as the NP-complete problems

Page 12: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

NP Complete Examples

• Traveling salesman

• Knapsack problem

• Partition problem

• Graph coloring

Page 13: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

P = NP ?

• P NP

• If P NP then NP-complete problems have exponential running time.

Page 14: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Polynomial vs. Exponential

Page 15: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Algorithms in Bioinformatics

• Algorithms to compare DNA, RNA, or protein sequences

• Database searches to find homologous sequences

• Sequence assembly

• Construction of evolutionary trees

• Structure prediction

Page 16: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Edit operations on sequences

AATAAGC

ATTAAGC

AAT-AAGC

AATTAAGC

AATAAGC

AA-AAGC

Substitution Insertion Deletion

Page 17: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

What is sequence alignment?

• Compare two sequences using matches, substitutions and indels.

G A A - - T C A T

G - T G G - C A -

• 3 matches, 1 substitution, 5 indels

Page 18: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Complexity of DNA Problems

• 3 billion base pairs in human genome

• Many NP complete problems

• 10600 possible alignments for two 1000 character sequences

Page 19: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Types of sequence alignment

• Determine the alignment of two sequences that maximizes similarity (global alignment)

• Determine substrings of two sequences with maximum similarity (local alignment)

• Determine the alignment for several sequences that maximizes the sum of pairs similarity (multiple alignment)

Page 20: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Significance of Alignment

• Functional similarity

• Structural similarity

• Homology

Page 21: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Scoring System

• Assign a score for each possible match, substitution and indel

• Distance functions – Find alignment to minimize distance between sequences

• Similarity functions – Find alignment to maximize similarity between sequences

Page 22: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Edit Distance

G A A - - T C A T

G - T G G - C A -

• Similarity function: 1 for match, -1 for substitution, -2 for indel

• Score: -8

Page 23: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Dynamic Programming

• Used on optimization problems

• Bottom-up approach

• Recursively builds up solution from subproblem optimal solutions

Page 24: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Dynamic Programming Alignment Algorithm (Needleman-Wunsch)

• Given sequences a1,a2,…,an and b1,b2,…,bm to be aligned:

• Initialize alignment matrix (aligning with spaces)

• Entry [i,j] gives optimal alignment score for sequences a1,a2,…,ai and b1,b2,…,bj (where 1 i n, 1 j m)

Page 25: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Computing Alignment Matrix

• Match ai+1 with bj+1

• Match ai+1 with a space —

• Match bj+1 with a space —

If a1,a2,…,ai and b1,b2,…,bj have been aligned,

there are three possible next moves:

Choose the move that maximizes the similarity of the two sequences

Page 26: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Global Alignment Matrix

— G G A C A

— 0 -2 -4 -6 -8 -10

G -2 1 -1 -3 -5 -7

G -4 -1 2 0 -2 -4

G -6 -3 0 1 -1 -3

C -8 -5 -2 -1 2 0

A -10 -7 -4 -1 0 3

T -12 -9 -6 -3 -2 1

Page 27: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Optimal Global Alignment

G G G C A T

G G A C A —

Page 28: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Alignment Running Time

• Assuming two sequences n characters each

• Running time is O(n2) (each entry of matrix must be calculated)

Page 29: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Variations of Alignment Algorithm

• Gap penalty

• Local alignment

• Multiple alignment

Page 30: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Gap Penalty

• A gap is a number k of consecutive spaces

• k consecutive spaces are more probable than k isolated spaces

• Typical gap penalty function: a + b·k (affine gap penalty)

• Here the first space in a gap is penalized a+b, further spaces are penalized b each.

Page 31: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Gap Penalty Example

• Use penalty, 1 + k

A - A - C - A

A C T A T C A

• Score: -6

A A C - - - A

A C T A T C A

• Score: -4

Page 32: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Local Alignment

• Find conserved regions in otherwise dissimilar sequences (e.g., viral and host DNA)

• Smith-Waterman algorithm

• Includes a fourth possibility at each step (don’t align)

Page 33: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Local Alignment Example

• Align the following

G C T C T G C G A A T A

C G T T G A G A T A C T

Page 34: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Optimal Local Alignment

G C T C T G C G A A T A

C G T T G A G A T A C T

(G C T C) T G C G A A T A

(C G T) T G A G - A T A (C T)

Page 35: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Multiple Alignment

• Find the alignment among a set of sequences that maximizes the sum of scores for all pairs of sequences

• Dynamic programming run-time for k sequences of length n: O(k2 2k nk)

• Multiple alignment is NP-complete

Page 36: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Other Features

• Usually used for protein alignment

• Can be used for global or local alignment

Page 37: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Multiple Alignment Example

P E A A L Y G R F T - - - I K S D V W

P E S L A Y N K F - - - S I K S D V W

P E A L N Y G R Y - - - S S E S D V W

P E A L N Y G W Y - - - S S E S D V W

P E V I R M Q D D N P F S F Q S D V Y

Page 38: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Multiple vs. Pairwise Alignment

• Optimal multiple alignment does not imply optimal pairwise alignment

AT A -

A - - T

- T

Page 39: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Substitution Matrices

• In homologous sequences certain amino acid substitutions are more likely to occur than others

• Types of substitution matrices* PAM* BLOSUM

Page 40: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

PAM Matrices

• Defines units of evolutionary distance

• 1 PAM unit represents an average of one mutation per 100 amino acids

• Start with a set of highly similar sequences and compute* pa = probability of occurrence of amino acid a

* Mab = probability of a mutating to b

Page 41: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

PAM Matrix Formula

• Entries in a k-PAM matrix

1010 logkab

b

M

p

Page 42: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

PAM250 MatrixC S T P A G N D E Q H R K M I L V F Y W

C 12

S 0 2

T -2 1 3

P -3 1 0 6

A -2 1 1 1 2

G -3 1 0 -1 1 5

N -4 1 0 -1 0 0 2

D -5 0 0 -1 0 1 2 4

E -5 0 0 -1 0 0 1 3 4

Q -5 -1 -1 0 0 -1 1 2 2 4

H -3 -1 -1 0 -1 -2 2 1 1 3 6

R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6

K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5

M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6

I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5

L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6

V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4

F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9

Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10

W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17

Page 43: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

BLOSUM Matrices (Omit)

• Uses log-odds ratio similar to PAM

• Uses short highly conserved sequences

• BLOSUM x matrices created after removing sequences that are more than x percent identical

• Better at local alignments

Page 44: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

BLOSUM Matrices

• A motif is a conserved amino acid pattern found in a group of proteins with similar biological meaning (PROSITE)

• A block is a conserved amino acid pattern in a group of proteins (no spaces allowed in the pattern) (BLOCKS)

Page 45: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Motif Example

• Motif obtained from a group of 34 tubulin proteins

M[FYW] . . F[VLI]H . [FYW] . . EGM

Page 46: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Defining BLOSUM (I)

• BLOSUMn uses blocks that are n% identical (BLOSUM62 is most common)

• Consider all pairs of amino acids appearing in the same column in the blocks

Page 47: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Defining BLOSUM (II)

• Define n(i,j) to be the frequency that amino acids i,j appear in a column pair

• Define e(i,j) to be the frequency that amino acids i,j appear in any pair

• Define BLOSUM entry

2

( , )( , ) log

( , )

n i js i j

e i j

Page 48: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

PAM vs. BLOSUM

• PAM derived from highly similar sequences (evolutionary model)

• BLOSUM derived from protein families sharing a common ancestor (conserved domain model)

Page 49: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Database Searches

• FASTA

• BLAST

Page 50: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

FASTA

• Looks for sequences in a database similar to a query sequence

• Heuristic, exclusion method

• Compares query sequence to each database sequence (called the text)

Page 51: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

FASTA Algorithm (I)

• Look for small substrings in query and text that exactly match (“hot spots”)

• Find ten best “diagonal runs” of hot spots

Page 52: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Hot Spot Example

E K L A S R K L

H

A *

S *

H

K *

L *

Page 53: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

FASTA Algorithm (II)

• Find best local alignment for each run

• Combine these into larger alignment

• Do multiple alignment on query and texts having highest score in last step

Page 54: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

BLAST

• Basic Local Alignment Search Tool

• Heuristic, exclusion method

• Computes statistical significance of alignment scores

Page 55: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

BLAST Algorithm

• Find all w-length substrings in text that align to some w-length substring in query with score above a given threshold (called “hits”)

• Extend these hits as far as possible (“segment pairs”)

• Report the highest scoring segment pairs

Page 56: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

Other Bioinformatics Algorithms

• Palindromes

• Tandem Repeats

• Longest Common Subsequence

• Double Digest (NP complete)

• Shortest Common Superstring (NP complete)

Page 57: Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Bioinformatics Workshop, Fall 2003

References

• Clote and Backofen, Computational Molecular Biology, Wiley

• Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge University Press

• Mount, Bioinformatics, Cold Spring Harbor Press• Setubal and Meidanis, Introduction to

Computational Molecular Biology, PWS• Waterman, Introduction to Computational Biology,

CRC Press