This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
We want to be able to say things like “this serine is phorphorylated in the database protein, so in my homologous protein the corresponding serine is likely to be phosphorylated too”.
That requires that the green serine and the purple serine both come from a common ancestor that was phosphorylated too.
And that, in turn, requires that both serines are located at the same location in their respective structures.
To know if positions in two different proteins are equivalent, we need to know both protein structures and compare them with protein structure comparison software.
But by the time you have solved one or two protein structures the four years of your PhD period are over...
So, we need a short-cut, and that, ladies and gentleman, will be a sequence alignment (i.e. Blast + ...).
Sequence alignment is a simple concept. You only have to find out which pairs of residues in two homologous sequences are derived from the same residue in the common ancestor.
To score the quality of an alignment you need ‘something’ that compares amino acids, a matrix.
Contains scores for pairs of residues
So, for protein/protein comparisons we need a 20 x 20 matrix of similarity scores where identical amino acids and those of similar character give higher scores compared to those of different character.
(And next week you will learn which residues are similar)
Not all amino acids are equalResidues mutate more easily to similar onesResidues at surface mutate more easilyAromatics mutate preferably into aromatics
Mutations tend to favor some substitutionsCore tends to be hydrophobic
Selection tends to favor some substitutionsCysteines are dangerous at the surfaceCysteines in bridges seldom mutate
Given the frequency of Leu and Val in my sequences, and the frequency of mutations,, do I see more mutations of V L than I would expect by chance alone?
Score of mutation A B = log (observed a b mutation / expected a b mutations)
This is called a log odd and can be negative, zero, or positive. Zero means no information, no contribution to the score of the alignment.
When using a log odds matrix, the total score of the alignment is given by the sum of the scores for each aligned pair of residues.
This log odds matrix is called PAM 1. An evolutionary distance of 1 PAM (point accepted mutation) means there has been 1 point mutation per 100 residues
PAM 1 may be used to generate matrices for greater evolutionary distances by multiplying it repeatedly by itself.
PAM250: – 2,5 mutations per residue.– equivalent to 20% matches remaining between two sequences,
i.e. 80% of the amino acid positions are observed to have changed (one or more times).
Matrices based on the Dayhoff model of evolutionary rates are derived from alignments of sequences that are at least 85% identical; that might not be optimal…
An alternative approach has been developed by Henikoff and Henikoff using local multiple alignments of more distantly related sequences.
Question: What database sequences are most similar to (or contain the most similar regions to) my own sequence?
•BLAST finds the highest scoring locally optimal alignments between a query sequence and all database sequences. •Very fast algorithm•Can be used to search extremely large databases•Sufficiently sensitive and selective for most purposes•Robust – the default parameters can usually be used
The program first looks for series of short, highly similar fragment, it extends these matching segments in both directions by adding residues. Residues will be added until the incremental score drops below a threshold.
•Entering your query sequence (cut-and-paste)•Select the database(s) you want to searchAnd, optionally:•Choose output parameters•Choose alignment parameters (scoring matrix, filters,….)
Example query=>somethingAFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPWQVTLQDRSGFHFC GGSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDEDNIQVLRIAKVFKQPKYSILTVNND ITLLKLASPARYSQTISAVCLPSVDDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT NAECKRSWGRRLTDVMICGAASGVSSCMGDSGGPLVCQKDGAYTLVAIVSWASDTCSASS GGVYAKVTKIIPWVQKILSSN
P-value (probability) Relates the score for an alignment to the likelihood that it arose by chance. The closer to zero, the greater the confidence that the hit is real.
E-value (expect value)The number of alignments with E that would be expected by chance in that database (e.g. if E=10, 10 matches with scores this high are expected to be found by chance).A match will be reported if its E is below the threshold.Lower E thresholds are more stringent, and report fewer matches.
Many sequences contain repeats or stretches that consist predominantly of one type of amino acid.
E.g. Many nuclear proteins have a poly-asparagine tail, membrane proteins often consist of mainly hydrophobic amino acids, or many binding proteins have proline rich stretches.