BNFO 602 Lecture 2 Usman Roshan
Jan 13, 2016
BNFO 602Lecture 2
Usman Roshan
DNA Sequence Evolution
AAGACTT -3 mil yrs
-2 mil yrs
-1 mil yrs
today
AAGACTT
T_GACTTAAGGCTT
_GGGCTT TAGACCTT A_CACTT
ACCTT (Cat)
ACACTTC (Lion)
TAGCCCTTA (Monkey)
TAGGCCTT (Human)
GGCTT(Mouse)
T_GACTTAAGGCTT
AAGACTT
_GGGCTT TAGACCTT A_CACTT
AAGGCTT T_GACTT
AAGACTT
TAGGCCTT (Human)
TAGCCCTTA (Monkey)
A_C_CTT (Cat)
A_CACTTC (Lion)
_G_GCTT (Mouse)
_GGGCTT TAGACCTT A_CACTT
AAGGCTT T_GACTT
AAGACTT
Sequence alignments
They tell us about
• Function or activity of a new gene/protein
• Structure or shape of a new protein
• Location or preferred location of a protein
• Stability of a gene or protein
• Origin of a gene or protein
• Origin or phylogeny of an organelle
• Origin or phylogeny of an organism
• And more…
Pairwise alignment
• X: ACA, Y: GACAT• Match=8, mismatch=2, gap-5
ACA-- -ACA- --ACA ACA----GACAT GACAT GACAT G--ACAT8+2+2-5-5 -5+8+8+8-5 -5-5+2+2+2 2-5-5-5-5-5-5Score = 2 14 -4 -28
Optimal alignment
• An alignment can be specified by the traceback matrix.
• How do we determine the traceback for the highest scoring alignment?
• Needleman-Wunsch algorithm for global alignment– First proposed in 1970 – Widely used in genomics/bioinformatics– Dynamic programming algorithm
Needleman-Wunsch
• Input: – X = x1x2…xn, Y=y1y2…ym – (X is seq2 and Y is seq1)
• Define V to be a two dimensional matrix with len(X)+1 rows and len(Y)+1 columns
• Let V[i][j] be the score of the optimal alignment of X1…i and Y1…j.
• Let m be the match cost, mm be mismatch, and g be the gap cost.
Dynamic programmingInitialization:for i = 1 to len(seq2) { V[i][0] = i*g; }For i = 1 to len(seq1) { V[0][i] = i*g; }
Recurrence:for i = 1 to len(seq2){
for j = 1 to len(seq1){
V[i-1][j-1] + m(or mm)V[i][j] = max { V[i-1][j] + g
V[i][j-1] + g
if(maximum is V[i-1][j-1] + m(or mm)) then T[i][j] = ‘D’else if (maximum is V[i-1][j] + g) then T[i][j] = ‘U’else then T[i][j] = ‘L’
}}
Example
Input: seq2: ACAseq1: GACAT
m = 5mm = -4gap = -20
seq2 is lined along the rowsand seq2 is along the columns
0 -20 -40 -60 -80 -100
-20 -4 -15 -35 -55 -75
-40 -24 -8 -10 -30 -50
-60 -44 -19 -12 -5 -25
L L L L L
U D D L L L
U U D D L L
U U D D D L
V
T
G A C A T
ACA
Affine gap penalties
• Affine gap model allows for long insertions in distant proteins by charging a lower penalty for extension gaps. We define g as the gap open penalty (first gap) and e as the gap extension penalty (additional gaps)
• Alignment:– ACACCCT ACACCCC
– ACCT T AC CTT
– Score = 0 Score = 0.9
• Trivial cost matrix: match=+1, mismatch=0, gapopen=-2, gapextension=-0.1
Affine penalty recurrence
€
V (i, j) = max{E(i, j),F(i, j),M(i, j)
M(i, j) =V (i −1, j −1) + s(x i,y j )
E(i, j) = max{E(i, j −1) + ext,V (i, j −1) + g}
F(i, j) = max{F(i −1, j) + ext,V (i −1, j) + g}
M(i,j) denotes alignments of x1..i and y1..j ending witha match/mismatch. E(i,j) denotes alignments of x1..i
and y1..j such that yj is paired with a gap. F(i,j) definedsimilarly. Recursion takes O(mn) time where m and n are lengths of x and y respectively.
How do we pick gap parameters?
Structural alignments
• Recall that proteins have 3-D structure.
Structural alignment - example 1
Alignment of thioredoxins fromhuman and fly taken from theWikipedia website. This proteinis found in nearly all organismsand is essential for mammals.
PDB ids are 3TRX and 1XWC.
Structural alignment - example 2
Computer generated aligned proteins
Unaligned proteins.2bbm and 1top areproteins from fly andchicken respectively.
Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html
Structural alignments
• We can produce high quality manual alignments by hand if the structure is available.
• These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.
Benchmark alignments
• Protein alignment benchmarks– BAliBASE, SABMARK, PREFAB,
HOMSTRAD are frequently used in studies for protein alignment.
– Proteins benchmarks are generally large and have been in the research community for sometime now.
– BAliBASE 3.0
Biologically realistic scoring matrices
• PAM and BLOSUM are most popular• PAM was developed by Margaret
Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins
• BLOSUM is more recent and computed from blocks of sequences with sufficient similarity
PAM
• We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j
• Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families
• Compute probabilities of change and background probabilities by simple counting
Expected accuracy alignment
• The dynamic programming formulation allows us to find the optimal alignment defined by a scoring matrix and gap penalties. This may not necessarily be the most “accurate” or biologically informative.
• We now look at a different formulation of alignment that allows us to compute the most accurate one instead of the optimal one.
Posterior probability of xi aligned to yj
• Let A be the set of all alignments of sequences x and y, and define P(a|x,y) to be the probability that alignment a (of x and y) is the true alignment a*.
• We define the posterior probability of the ith residue of x (xi) aligning to the jth residue of y (yj) in the true alignment (a*) of x and y as
Do et. al., Genome Research, 2005
€
P(x i ~ y j ∈ a* | x,y) = P(a | x,y)1{x i ~ y j ∈ a}
a∈A
∑
Expected accuracy of alignment
• We can define the expected accuracy of an alignment a as
• The maximum expected accuracy alignment can be obtained by the same dynamic programming algorithm
V i j
V i j P x y
V i j
V i j
i j
( , ) max
( , ) ( ~ )
( , )
( , )
=− − +
−−
⎧
⎨⎪
⎩⎪
⎫
⎬⎪
⎭⎪
1 111
Do et. al., Genome Research, 2005
Example for expected accuracy
• True alignment• AC_CG• ACCCA• Expected accuracy=(1+1+0+1+1)/4=1
• Estimated alignment• ACC_G• ACCCA• Expected accuracy=(1+1+0.1+0+1)/4 ~ 0.75
Estimating posterior probabilities• If correct posterior probabilities can be computed
then we can compute the correct alignment. Now it remains to estimate these probabilities from the data
• PROBCONS (Do et. al., Genome Research 2006): estimate probabilities from pairwise HMMs using forward and backward recursions (as defined in Durbin et. al. 1998)
• Probalign (Roshan and Livesay, Bioinformatics 2006): estimate probabilities using partition function dynamic programming matrices
Example
Local alignment
• Global alignment recurrence:
• Local alignment recurrence
V (i, j) =max
V(i −1, j −1)+S(xi ,yj )V(i −1, j) + gV(i, j −1)+ g
⎧
⎨⎪
⎩⎪
⎫
⎬⎪
⎭⎪
V (i, j) =max
0V(i −1, j −1) +S(xi ,yj )
V(i −1, j) + gV(i, j −1) + g
⎧
⎨⎪⎪
⎩⎪⎪
⎫
⎬⎪⎪
⎭⎪⎪
Local alignment traceback
• Let T(i,j) be the traceback matrices and m and n be length of input sequences.
• Global alignment traceback: – Begin from T(m,n) and stop at T(0,0).
• Local alignment traceback: – Find i*,j* such that T(i*,j*) is the maximum over all T(i,j).– Begin traceback from T(i*,j*) and stop when
T(i,j) <= 0.
BLAST
• Local pairwise alignment heuristic
• Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.
• Online server: http://www.ncbi.nlm.nih.gov/blast
BLAST
1. Given a query q and a target sequence, find substrings of length k (k-mers) of score at least t --- also called hits. k is normally 3 to 5 for amino acids and 12 for nucleotides.
2. Extend each hit to a locally maximal segment. Terminate the extension when the reduction in score exceeds a pre-defined threshold
3. Report maximal segments above score S.
Finding k-mers quickly
• Preprocess the database of sequences:– For each sequence in the database store all k-
mers in hash-table.– This takes linear time
• Query sequence:– For each k-mer in the query sequence look up the
hash table of the target to see if it exists– Also takes linear time