Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHNALQRRTVWVNAY MHHALQRRTVWVNAY- MHH-ALQRRTVWVNAY Blosum Score = 2 (end = -6) Score = 79 (gap = -6) An alignment must have equal length aligned sequences – So, we must add gaps at the start and the ends Combinatorially difficult problem to find best indel solution
25
Embed
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Sequence Alignments with Indels
Evolution produces insertions and deletions (indels)– In addition to substitutions
An alignment must have equal length aligned sequences– So, we must add gaps at the start and the ends
Combinatorially difficult problem to find best indel solution
Gap
So far we ignored gaps A gap corresponds to an insertion or a deletion of a
residue A conventional wisdom dictates that the penalty for
a gap must be several times greater than the penalty for a mutation. That is because a gap/extra residue– Interrupts the entire polymer chain– In DNA shifts the reading frame
Gap Penalties
Gaps are penalised– Write wx to indicate the penalty for a gap of length x– For example, each gap scores -6, so wx = -6*x
One common scheme is– Score -12 for opening a gap– And -2 for every subsequent gap– i.e., wx = -12 - 2*(x-1)
Start and end gap penalties often set to zero– But this can leave a doubt
About evolutionary conclusions
Dot Matrix Representations (Dotplots)To help visualise best alignments
Plot where each pair is the same, then draw best line
M N A L S Q L N
N
A
L
M
S
Q
N
H
M N A L S Q L N
N
A
L
M
S
Q
N
H
Getting Alignments from Dotplot Paths
M N A L S Q L N
N
A
L
M
S
Q
N
H
Indicates that M matches with a gap
Indicates that L matches with a gap
Stage 1:– Align middle– Use triangles
To indicate gaps
NAL-SQLN NALMSQ-N Stage 2:
– Sort the ends out
MNAL-SQLN- -NALMSQ-NH
Dotplots for Real Proteins
Need a way to automatically find the best path(s)
Dynamic Programming Approach
BLAST is quick– But not guaranteed to find best alignment– Gapped blast has indels, but no guarantee…
Dynamic Programming:– Also known as: Needleman-Wunsch Algorithm
Can use it to draw the Dotplot paths– From that we can get the alignment
Mathematically guaranteed– To find the best scoring alignment– Given a substitution scheme (scoring scheme, e.g., BLOSUM)– And given a gap penalty
The Needleman-Wunsch algorithm
A smart way to reduce the massive number of possibilities that need to be considered, yet still guarantees that the best solution will be found (Saul Needleman and Christian Wunsch, 1970).
The basic idea is to build up the best alignment by using optimal alignments of smaller subsequences.
The Needleman-Wunsch algorithm is an example of dynamic programming, a discipline invented by Richard Bellman (an American mathematician) in 1953!
Dynamic Programming
A divide-and-conquer strategy:– Break the problem into smaller subproblems.– Solve the smaller problems optimally.– Use the sub-problem solutions to construct an optimal
solution for the original problem. Dynamic programming can be applied only to problems
exhibiting the properties of overlapping subproblems. Examples include
– Trevelling salesman problem– Finding the best chess move
Overview of Needleman-Wunsch
Four Stages1. Initialise a matrix for the sequences
2. Fill in the entries of that matrix (call these Si,j) At the same time drawing arrows in the matrix
3. Use the arrows to find the best scoring path(s)
4. Interpret the paths as alignments as before
Illustrate with: MNALQM & NALMSQA
Stage 1Initialising the Matrix
Draw the grid
Put in increasing gap penalties Then put in BLOSUM scores
Stage 2Putting Scores and Arrows in
Put the score in Draw the arrow
Mathematically, we are calculating:
Where: – Si,j is the matrix entry at (i,j) [the one we want to fill in]
Si-1,j-1 is above and to the left of this
– s(ai,bj) is the BLOSUM score for the i-th residue from the horizontal sequence and j-th residue from the vertical sequance (i.e., just the scores we have written in brackets)
This diagram might help:
Fill in the next row and column
A Close up View
Continue filling in the Si,j entries
Stage 3Finding the best path
Scores Si,j in the matrix – Are the BLOSUM scores for alignments
However!– We must take into account final gap penalties
Look down the final column and along the final row– Find the highest scoring number– Remembering to take off the gap penalty the correct
number of times
Finding the best path
So, the best path is:
Stage 4: Generating the Alignment Firstly, draw the Dotplot
Secondly, Generate the Alignment
Using the technique previously mentioned– This path gives us an alignment with three gaps
M N A L - - Q M - N A L M S Q AS = -6 6 4 4 -6 -6 5 -1 = 0
Should check that you get the same score– As on the diagram
Other Alignments
MNALQ-M- MNALQM--
-NALMSQA (score=-4) -NALMSQA (score=-5)
Smith - Waterman Alterations
To make the algorithm find best local alignments Adjustments only to the scoring scheme for Si,j:
– The scoring scheme must include: Some negative scores for mismatches
– When Si,j becomes negative, set it to zero So local paths are not penalised for earlier bad routes
To find best local alignment– Find highest scoring matrix position (anywhere)– And work backwards until a zero is reached
Local and Global Alignments
Needleman & Wunsch best global alignments
Smith & Watermanbest local alignments
For illustration purposes only– Calculations done slightly differently (don’t worry)