Statistical significance of alignments

Statistical Significance of Alignments

Dr Avril [email protected]

Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint

Biological importance of alignments• A sequence alignment represents a hypothesis about the homology of

individual positions in different sequences:

• Based on an alignment, we quantify similarity• Sequence similarity suggests a shared evolutionary history

Furthermore, proteins with very similar sequences probably have similar biological functionsHypothesis: Y in seqs 1,2 is homologous to E in seqs 3,4

• Once we have an alignment between 2 sequences, we can calculate their similarity over their lengthsA measure of similarity is percent identity, ie. number of identical

amino acids * 100 / length of the alignmenteg. the alignment below is 39 amino acids long, & the human & fruitfly

sequences differ at 1 position→ Human & fruitfly sequences have a percent identity of (38*100/39 =)

97% in this part of the Eyeless PAX domain

HumanMouseCatSea squirtFruitfly

1 2 3 4 5 6 7 8 9 10 1112 1314 15 16 17 18 19 20 21 22 2324 25 26 27 28 29 3031 32 33 34 35 3637 38 39

Human and fruitfly Eyeless proteins differ at this position

Similarity versus homology

• Homologues are similar because they had a common ancestor eg. eyeless homologues

• After aligning two sequences, we can say they are 99% similar, or 50 similar, etc.

• Any 2 random sequences are similar to some extent, so similarity doesn’t necessarily imply homology

V I V A L A S V E GV I V A V A S V E G 90% similar

V I V A L A S V E GT S Y A V F G R T W 10% similar

Very similar sequences are probably homologues

Sequences with very low similarity may be homologues

Similarity versus homology

• Two girls are either sisters or not• Two sequences are either homologues or not

V I V A L A S V E GV I V A V A S V E G 90% similar

V I V A L A S V E GT S Y A V F G R T W 10% similar

“90% homologous”

“10% homologous”

Incorrect!

Incorrect!

A key question is:

• How does one interpret minimal similarity?Are the sequences actually related, or is the alignment by chance?

Q K G S Y Q E K G Y C

Q Q E S G P V R S T C| | |

Statistical analysis of alignments

• We’ve calculated the score for the best alignment between 2 sequences A and B, but is it due to chance or biology?

• Sequences accumulate substitutions over millions of years, so it is sometimes hard to decide if 2 sequences are homologous

• Unrelated sequences may be somewhat similar due to chance

In humans, mutations in the PTCH2 gene are a cause of brain tumours and skin cancers

In the nematode Caenorhabditis elegans, the tra-2 gene functions in development to determine the sex of the embryo C. elegans adults can be male (make sperm) or hermaphrodite (make sperm & eggs)

Alignment of human PTCH2 & Caenorhabditis elegans TRA2 (score = 136):

Are human PTCH2 and C. elegans tra-2 homologues?

Statistical significance of the alignment

• To decide if we two sequences are likely to be homologues (related), we calculate the statistical significance of the alignment score

• To do this, we first need a null model (background model), ie. a statistical model that will let us calculate what we expectThere are many proteins in all the different species2 randomly chosen proteins are expected to be unrelatedOur null model should therefore describe the alignment scores expected for pairs of unrelated sequences

• How can we know the alignment scores for pairs of unrelated protein sequences?We could generate random protein sequences, & calculate alignment scores for pairs of random protein sequencesWe can use a multinomial model to generate random protein sequencesie. make a roulette wheel with different fractions of the wheel labelled

for each of the 20 amino acidsThen spin thin wheel n times to make a random protein sequences that

is n amino acids long

In this multinomial model, p(P)=0.14, p(A)=0.28, p(W)=0.14, p(H)=0.14, p(E)=0.28All the other amino acids have probabilities of zero here

• A good multinomial model for random sequences should take in the sequence compositioneg. we could use a multinomial model to generate random sequences of

the same composition as C. elegans TRA2

ie. make a roulette wheel where the fraction of the circle labelled with each of the 20 amino acids is set equal to the % of that amino acid in the TRA2 sequence

• One way to see if an alignment score is statistically significant is to compare it to the scores for alignments of random sequencesWe make a random sequence of the same length amino acid composition as one of our original 2 sequences (eg. TRA2)ie. use our ‘TRA2’ multinomial model to do make a sequence

Alignment of human PTCH2 & a random sequence generated using a multinomial model (with the probabilities of amino acids set equal to their fractions in TRA2) (score = 51):

• We can generate 200 random sequences using our TRA2-like multinomial model

For each random sequence, we can calculate the best alignment score for the random sequence and human PTCH2

Compare the scores obtained with the score seen for PTCH2 & TRA2 eg.

What % of the random sequences have a score equal to or higher than that for TRA2 & PTCH2? eg. 0.95 in the picture

This method can be used to estimate the significance of alignments in the form of P-values, eg. P=0.05 in the picture

We accept the alignment as significant (indicating probable homology) if the score is in the top 5% (or another chosen value) of the scores for random sequences, ie. if P ≤ 0.05

Number of alignments of random sequences Alignment

score

Alignment score for proteins PTCH2 & TRA2

5% of scores for alignments of random sequences

eg. for human PTCH2 and C. elegans TRA2:The alignment score is 136When 200 random sequences (generated with a ‘TRA2’ multinomial model) were aligned to PTCH2, only 0.36% alignments had a score of ≥136Therefore, we estimate a P-value of P=0.0036 ie. we estimate that the probability of getting a score of 136 for PTCH2

and TRA2 due to chance is 0.0036 (36/10,000)

Human PTCH2 and C. elegans tra-2 are probably homologues

Alignment of human PTCH2 & C. elegans TRA2 (score = 136):

In the example below, 0.95 of the random sequences have an alignment score equal to or higher to that for A & B, so P=0.95

Number of alignments of random sequences Alignment

score

Alignment score for a different A & B

95% of scores for alignments of random sequences

Alignment of fruitfly Eyeless & C. elegans TRA2 (score = 78):

P = 1 eyeless and tra-2 are probably not homologues

Further Reading• Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn• Chapter 6 in Deonier et al book Computational Genome Analysis• Practical on alignment in R in the Little Book of R for Bioinformatics:

https://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter4.html

Statistical significance of alignments

Education