Statistical Significance of Alignments Dr Avril Coghlan [email protected] this talk contains animations which can only be se oading and using ‘View Slide show’ in Powerpoint
Statistical Significance of Alignments
Dr Avril [email protected]
Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint
Biological importance of alignments• A sequence alignment represents a hypothesis about the homology of
individual positions in different sequences:
• Based on an alignment, we quantify similarity• Sequence similarity suggests a shared evolutionary history
Furthermore, proteins with very similar sequences probably have similar biological functionsHypothesis: Y in seqs 1,2 is homologous to E in seqs 3,4
• Once we have an alignment between 2 sequences, we can calculate their similarity over their lengthsA measure of similarity is percent identity, ie. number of identical
amino acids * 100 / length of the alignmenteg. the alignment below is 39 amino acids long, & the human & fruitfly
sequences differ at 1 position→ Human & fruitfly sequences have a percent identity of (38*100/39 =)
97% in this part of the Eyeless PAX domain
HumanMouseCatSea squirtFruitfly
1 2 3 4 5 6 7 8 9 10 1112 1314 15 16 17 18 19 20 21 22 2324 25 26 27 28 29 3031 32 33 34 35 3637 38 39
Human and fruitfly Eyeless proteins differ at this position
Similarity versus homology
• Homologues are similar because they had a common ancestor eg. eyeless homologues
• After aligning two sequences, we can say they are 99% similar, or 50 similar, etc.
• Any 2 random sequences are similar to some extent, so similarity doesn’t necessarily imply homology
V I V A L A S V E GV I V A V A S V E G 90% similar
V I V A L A S V E GT S Y A V F G R T W 10% similar
Very similar sequences are probably homologues
Sequences with very low similarity may be homologues
Similarity versus homology
• Two girls are either sisters or not• Two sequences are either homologues or not
V I V A L A S V E GV I V A V A S V E G 90% similar
V I V A L A S V E GT S Y A V F G R T W 10% similar
“90% homologous”
“10% homologous”
Incorrect!
Incorrect!
A key question is:
• How does one interpret minimal similarity?Are the sequences actually related, or is the alignment by chance?
Q K G S Y Q E K G Y C
Q Q E S G P V R S T C| | |
Statistical analysis of alignments
• We’ve calculated the score for the best alignment between 2 sequences A and B, but is it due to chance or biology?
• Sequences accumulate substitutions over millions of years, so it is sometimes hard to decide if 2 sequences are homologous
• Unrelated sequences may be somewhat similar due to chance
In humans, mutations in the PTCH2 gene are a cause of brain tumours and skin cancers
In the nematode Caenorhabditis elegans, the tra-2 gene functions in development to determine the sex of the embryo C. elegans adults can be male (make sperm) or hermaphrodite (make sperm & eggs)
Alignment of human PTCH2 & Caenorhabditis elegans TRA2 (score = 136):
Are human PTCH2 and C. elegans tra-2 homologues?
Statistical significance of the alignment
• To decide if we two sequences are likely to be homologues (related), we calculate the statistical significance of the alignment score
• To do this, we first need a null model (background model), ie. a statistical model that will let us calculate what we expectThere are many proteins in all the different species2 randomly chosen proteins are expected to be unrelatedOur null model should therefore describe the alignment scores expected for pairs of unrelated sequences
• How can we know the alignment scores for pairs of unrelated protein sequences?We could generate random protein sequences, & calculate alignment scores for pairs of random protein sequencesWe can use a multinomial model to generate random protein sequencesie. make a roulette wheel with different fractions of the wheel labelled
for each of the 20 amino acidsThen spin thin wheel n times to make a random protein sequences that
is n amino acids long
In this multinomial model, p(P)=0.14, p(A)=0.28, p(W)=0.14, p(H)=0.14, p(E)=0.28All the other amino acids have probabilities of zero here
• A good multinomial model for random sequences should take in the sequence compositioneg. we could use a multinomial model to generate random sequences of
the same composition as C. elegans TRA2
ie. make a roulette wheel where the fraction of the circle labelled with each of the 20 amino acids is set equal to the % of that amino acid in the TRA2 sequence
• One way to see if an alignment score is statistically significant is to compare it to the scores for alignments of random sequencesWe make a random sequence of the same length amino acid composition as one of our original 2 sequences (eg. TRA2)ie. use our ‘TRA2’ multinomial model to do make a sequence
Alignment of human PTCH2 & a random sequence generated using a multinomial model (with the probabilities of amino acids set equal to their fractions in TRA2) (score = 51):
• We can generate 200 random sequences using our TRA2-like multinomial model
For each random sequence, we can calculate the best alignment score for the random sequence and human PTCH2
Compare the scores obtained with the score seen for PTCH2 & TRA2 eg.
What % of the random sequences have a score equal to or higher than that for TRA2 & PTCH2? eg. 0.95 in the picture
This method can be used to estimate the significance of alignments in the form of P-values, eg. P=0.05 in the picture
We accept the alignment as significant (indicating probable homology) if the score is in the top 5% (or another chosen value) of the scores for random sequences, ie. if P ≤ 0.05
Number of alignments of random sequences Alignment
score
Alignment score for proteins PTCH2 & TRA2
5% of scores for alignments of random sequences
eg. for human PTCH2 and C. elegans TRA2:The alignment score is 136When 200 random sequences (generated with a ‘TRA2’ multinomial model) were aligned to PTCH2, only 0.36% alignments had a score of ≥136Therefore, we estimate a P-value of P=0.0036 ie. we estimate that the probability of getting a score of 136 for PTCH2
and TRA2 due to chance is 0.0036 (36/10,000)
Human PTCH2 and C. elegans tra-2 are probably homologues
Alignment of human PTCH2 & C. elegans TRA2 (score = 136):
In the example below, 0.95 of the random sequences have an alignment score equal to or higher to that for A & B, so P=0.95
Number of alignments of random sequences Alignment
score
Alignment score for a different A & B
95% of scores for alignments of random sequences
Alignment of fruitfly Eyeless & C. elegans TRA2 (score = 78):
P = 1 eyeless and tra-2 are probably not homologues
Further Reading• Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn• Chapter 6 in Deonier et al book Computational Genome Analysis• Practical on alignment in R in the Little Book of R for Bioinformatics:
https://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter4.html