Top Banner
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from http:// www.bioalgorithms.info/slides.htm
35

Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Jan 01, 2016

Download

Documents

Roxanne Willis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Pairwise Sequence Alignment (II)

(Lecture for CS498-CXZ Algorithms in Bioinformatics)

Sept. 27, 2005

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm

Page 2: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Review: Dynamic Programming for LCS

-Edit graph representation of alignment

-Path = alignment

-Incrementally fill in the table

-Backtrack to find the best alignment

Page 3: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

The LCS Recurrence Revisited

• The formula can be rewritten by adding zero to the edges that come from an indel, since the penalty of indels are 0:

si-1, j-1+1 if vi = wj

si,j = max si-1, j + 0

si, j-1 + 0 Insertion/deletion score

Matching score

How do we improve scoring?

Page 4: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

How do we improve the scoring of alignments?

Can we still find an alignment efficiently?

Page 5: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Outline

• Improve Scoring

– Scoring Matrix

– Affine Gap Penalty

• Variants of Alignment

– Global vs. Local alignment

• Assessing Score Significance

Page 6: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Scoring Matrices

To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ.

In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score with comparison of a gap character “-”.

This will simplify the scoring algorithm as follows:

si-1,j-1 + δ (vi, wj)

si,j = max s i-1,j + δ (vi, -)

s i,j-1 + δ (-, wj)The same dynamic programming algorithm would still work!

Page 7: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

The Global Alignment Problem

Find the best alignment between two strings under a given scoring matrix

Input : Strings v & w and a scoring matrix δ

Output : Alignment of maximum score

Algorithm: Dynamic programming

si-1,j-1 + δ (vi, wj)si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj)

The only question left is how to define the scoring matrix…

Page 8: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Measuring Similarity

• Measuring the extent of similarity between two sequences

– Based on percent sequence identity

– Based on conservation

Page 9: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Percent Sequence Identity

• The extent to which two nucleotide or amino acid sequences are invariant

A C C T G A G – A G A C G T G – G C A G

70% identical

mismatchindel

Page 10: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Simple Scoring

• When mismatches are penalized by some constant –μ, indels are penalized by some other constant –σ, and matches are rewarded with +1, the resulting score is:

#matches – μ(#mismatches) – σ (#indels)

Page 11: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Making a Better Scoring Matrix

• Scoring matrices are created based on biological evidence.

• Alignments can be thought of as two sequences that differ due to mutations in the sequence.

• Some of these mutations have little effect on the organism’s function, therefore some penalties, δ(vi , wj), will be less harsh than others.

Page 12: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Scoring Matrix: ExampleA R N K

A 5 -2 -1 -1

R - 7 -1 3

N - - 7 0

K - - - 6

• Notice that although R and K are different amino acids, they have a positive score.

• Why? They are both positively charged amino acids will not greatly change function of protein.

Page 13: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Scoring matrices

• Amino acid substitution matrices

– PAM

– BLOSUM

• DNA substitution matrices

– DNA: less conserved than protein sequences

– Less effective to compare coding regions at nucleotide level

– Simple scoring is often used

Page 14: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

PAM

• Point Accepted Mutation (Dayhoff et al.)

• 1 PAM = PAM1 = 1% average change of all amino acid positions

– After 100 PAMs of evolution, not every residue will have changed

• some residues may have mutated several times

• some residues may have returned to their original state

• some residues may have not changed at all

Page 15: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

PAMX

• PAMx = PAM1x

– PAM250 = PAM1250

• PAM250 is a widely used scoring matrix:

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ...Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ...Arg R 3 17 4 3 2 5 3 2 6 3 2 9Asn N 4 4 6 7 2 5 6 4 6 3 2 5Asp D 5 4 8 11 1 7 10 5 6 3 2 5Cys C 2 1 1 1 52 1 1 2 2 2 1 1Gln Q 3 5 5 6 1 10 7 3 7 2 3 5...Trp W 0 2 0 0 0 0 0 0 1 0 1 0Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1Val V 7 4 4 4 4 4 4 4 5 4 15 10

Think of PAM1 as 1-step transitions and PAM250 as 250-step transitions

Page 16: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

BLOSUM

• Blocks Substitution Matrix

• Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins

• Matrix name indicates evolutionary distance

– BLOSUMx was created using sequences sharing no more than x% identity

– E.g., BLOSUM62 <-> 62% identity

Page 17: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

The Blosum50 Scoring Matrix

Val(x,y)=log(p(x,y)/p(x)p(y))

Probability of seeing x aligned with y

Probability of seeing x (or y) alone

Page 18: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Deficiency in Scoring of Indels

• A fixed penalty σ is given to every indel:

– -σ when there is 1 indel, -2σ for 2 consecutive indels, -3σ for 3 consecutive indels, etc.

Can be too severe penalty for a series of 100 consecutive indels

Page 19: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Deficiency in Scoring of Indels (cont.)

• In nature, many times indels come as a unit, not just at 1 nucleotide at a time.

Normal scoring would give the same score for both alignments

In nature, this is more likely.

Page 20: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Accounting for Gaps

• Gaps- contiguous sequence of spaces in one of the rows

• Score for a gap of length x is: -(ρ + σx), where ρ >0 is the penalty for introducing a gap. ρ will be large relative to σ because you do not want to add too much of a penalty for extending the gap.

Page 21: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Affine Gap Penalties

• Gap penalties:

– -ρ-σ when there is 1 indels, -ρ-2σ when there are 2 indels, -ρ-3σ when there are 3 indels, etc.

– -ρ- x * σ (-gap opening - x gap extensions)

• Somehow reduced penalties (as compared to naïve scoring) are given to runs of horizontal and vertical edges

Page 22: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Affine Gap Penalty Recurrences

si,j = s i-1,j - σ

max s i-1,j –(ρ+σ)

si,j = s i,j-1 - σ

max s i,j-1 –(ρ+σ)

si,j = si-1,j-1 + δ (vi, wj)

max s i,j

s i,j

Continue Gap in w (deletion)

Start Gap in w (deletion)

Continue Gap in v (insertion)

Start Gap in v (insertion)

Match or Mismatch

End deletion

End insertion

Once again, the same dynamic programming algorithm would work!

Page 23: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Local vs. Global Alignment

• The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph.

• The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph.

Page 24: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Local vs. Global Alignment (cont’d)

• Global Alignment

• Local Alignment—better alignment to find conserved segment

--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C

tccCAGTTATGTCAGgggacacgagcatgcagagac ||||||||||||

aattgccgccgtcgttttcagCAGTTATGTCAGatc

Page 25: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Local Alignments: Why?

• Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions.

• Example:

– Homeobox genes have a short region called the homeodomain that is highly conserved between species.

– A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence

Page 26: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

The Local Alignment Problem

• Goal: Find the best local alignment between two strings

• Input : Strings v, w and scoring matrix δ

• Output : Alignment of substrings of v & w whose alignment score is maximum among all possible alignment of all possible substrings

Page 27: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Local Alignment in Edit Graph

Global alignment

Local alignment

Compute a “mini” Global Alignment to get Local

Page 28: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

The Problem with this Problem

• Problem of this, long run time O(n4):

- There are ~n2 pairs of vertices (i,j)

- For each pair of vertices computing an alignment takes O(n2) time.

• Solution: Dynamic programming again!

• Question: How do we recursively compute the best score of any local (as opposed to global) alignment for each cell in the edit graph?

Page 29: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

The Local Alignment Recurrence

• The largest value of si,j over the whole edit graph is the score of the best local alignment.

• The recurrence is shown below:

0 si,j = max si-1,j-1 + δ (vi, wj)

s i-1,j + δ (vi, -)

s i,j-1 + δ (-, wj)

Notice there is only this change from the original recurrence of a Global Alignment

This is the well-known Waterman-Smith local alignment algoirthm

Page 30: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Assessing Score Signficance

• In general, larger s more significant. The question is how large should s be?

• Factors to be considered:

– Sequence length: longer sequences are expected to give higher scores

– # sequences in the database: the score of the best alignment is expected to be higher for a larger DB

– Evolution time: longer evolution causes more mismatches, making a lower score more significant

• The Challenge is how to quantify all these…

Page 31: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Two Basic Approaches

• The classical approach: Extreme value distribution (EVD)

– Assume a null (random) model for scores MR

– P(Score > s|MR, a(x, y))=? (a(x,y)=alignment of x, y)

• The Bayesian approach: Model comparison

– Assume two models for a(x,y): random R; aligned: M

– P(M|a(x,y))/P(R|a(x,y))=?

( | ( , )) ( ( , ) | ) ( )log log log

( | ( , )) ( ( , ) | ) ( )R R

p M a x y p a x y M p M

p R a x y p a x y M p M

Log-odds score of the alignment prior

Page 32: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

EVD of the Best Score in Ungapped Local Alignment

• The number of unrelated local matches with score higher than S is approximately Poisson distributed, with mean

• The probability that there is a match of score greater than S is

• K and can be fit using randomly generated data

• This gives a way to test statistical significance p(x>21)= 0.01 vs. p(x>21)=0.3

( ) SE S Kmne Sequence lengths

Parameters

( )( ) 1 ( 0) 1 E Sp x S p N e

( ) ( )( )

!

E S ne E Sp N n

n

Page 33: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Bayesian Model Comparison

1

1

1

( | ( , )) ( ( , ) | ) ( )log log log

( | ( , )) ( ( , ) | ) ( )

( , | ) ( )log log

( , | ) ( )

( , | ) ( )log log

( | ) ( | ) ( )

( )log ( , ) log

( )

ni i

i i i

ni i

i i i

n

i ii

p M a x y p a x y M p M

p R a x y p a x y R p R

p x y M p M

p x y R p R

p x y M p M

p x R p y R p R

p Ms x y

p R

• M is a model for related sequences• R is a model for unrelated sequences (random)• Ungapped alignment n=m• Alignment of each pair is independent

Assumptions:

1

( | ( , ))log 0?

( | ( , ))

( )log ( , ) log

( )

n

i ii

p M a x y

p R a x y

p Ms x y

p R

Prior(Subjective!)

ScoreS(x,y)

This partially addresses Q1: how to design the scoring function?

BLOSUMScoring

Page 34: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Pairwise Alignment Summary

X=x1,…,xn

Y=y1,…,ym

Model: scoring function s: A

Possible alignments of X and Y: A ={a1,…,ak}

Find the best alignment(s)

* arg max ( ( , ))aa s a X Y

X=x1,…,xn

Y=y1,…,ym

Q3: How can we find a* quickly?

Q1: How should we define s?

S(a*)= 21

Q4: Is the alignment biologically Meaningful or just the best alignment of two unrelated sequences?

Q2: How should we define A?

(Dynamic programming)

(Application-specific)

(Modeling evolution)

(Models for scores)Q1 & Q4 are related!

Page 35: Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

What You Should Know

• Alignment Scoring Methods (Matrix & Gap)

• Global vs. Local alignments

• How the dynamic programming algorithm solves both local and global alignments with a number of scoring strategies

• Basic idea in assessing score significance