Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter Program to determine their effects on output when comparing squid p53 and human p53. Create your own scoring matrix and use it to compare two protein sequences. Explain to the instructor the rationale behind your scoring matrix.
24
Embed
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Scoring MatricesJune 19, 2008
Learning objectives- Understand how scoring matrices are constructed.Workshop-Use different BLOSUM matrices in the Dotter Program to determine their effects on output when comparing squid p53 and human p53.Create your own scoring matrix and use it to compare two protein sequences. Explain to the instructor the rationale behind your scoring matrix.
Scoring Matrices
Scoring matrices appear in all analyses involving sequence comparisons. Scoring matrices implicitly represent a particular theory of relationships. Understanding theories underlying a given scoring matrix can aid in making proper choice of scoring matrix.
Scoring MatricesWhen we consider scoring matrices, we encounter the convention that matrices have numeric indices corresponding to the rows and columns of the matrix.
For example, M12 refers to the entry at the first row and the second column. In general, Mij refers to the entry at the ith row and the jth column.
Two major scoring matrices for amino acid sequence comparisons
PAM-derived from sequences known to be closely related (Eg. Chimpanzee and human). Generally ranges from PAM 1 to PAM 500
BLOSUM-derived from sequences not closely related (Eg. E. coli and human). Ranges from BLOSUM 10-BLOSUM 100
The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix
Started by Margaret Dayhoff, 1978A series of matrices describing the extent to which two amino acids have been interchanged in evolutionPAM 1 scoring matrix was obtained by aligning very similar sequences. Other PAMs were obtained by mathematical extrapolation1) neighbor independence; 2) positional independence; and 3) historical independence.
Dayhoff, M. O., Atlas of Protein Sequence and Structure Natl. Biomed. Res. Found., Silver Spring MD, 1978.
Protein families used to construct Dayhoff’s scoring matrix
Protein PAMs per 100 mil yrs
IgG kappa C region 37
Kappa casein 33
Serum Albumin 26
Cytochrome C 0.9
Histone H3 0.14
Histone H4 0.10
Calculation of relative mutability of amino acid
Find frequency of amino acid change at a certain position in protein.Divide this “change frequency” by the frequency that the amino acid occurs in all proteins. This gives the mutability of the amino acid.Multiply the alanine mutability by a factor to get the value 100.Multiply the 19 other a.a. mutabilities by the same factor.Result: Relative Mutabilities
Numbers of accepted point mutations, multiplied by 10
Original amino acids
Replacement amino acids
Creation of a mutation probability matrix
Used accepted point mutation data from previous slide and the mutability of each amino acid to create a mutation probability matrix.
Mij=(mj*Aij)/(sum_over_all_i Aij)
Mij shows the probability that an original amino acid j (in columns) will be replaced by amino acid i (in rows) over a defined evolutionary interval. For PAM 1, an average of 1% of aa’s were changed.
PAM1 mutational probability matrix
Values of each column will sum to 10,000
Orig. aa
Replacement aa
The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix
Observed %aa Difference
Evolutionary Distancein PAMs
1510204050607080
1511235680112159246
Final Scoring Matrix is the Log-Odds Score Matrix
S (a,b) = 10 log10(Mab/Pb)
Original amino acid
Replacement amino acid
Mutational probability matrix number (from PAM 250)
Normalized frequency of amino acid b
S(a,alanine) = 10 log(0.13/0.087)=1.7 (round to 2)
At this evolution-ary distance, there is a 13% chancethat the second sequence will also have an alanine.
Summary of PAM Scoring Matrix
PAM = a unit of evolution (1 PAM = average of 1 point mutation/100 amino acids)
Accepted Mutation means fixed point mutation
Comparison of 71 groups of closely related proteins yielding 1,572 changes. (>85% identity)
Different PAM matrices are derived from the PAM 1 matrix by matrix multiplication.
The matrices are converted to log odds matrices.
BLOSUM Matrix (BLOcks SUbstitution Matrices)
Blocks Sum-created from BLOCKS databaseA series of matrices describing the extent to which two amino acids are interchangeable in conserved structures of proteinsThe number in the series represents the threshold percent similarity between sequences, for consideration for calculation
(Eg. BLOSUM62 means 62% of the aa’s were similar)
BLOSUM
BLOSUMs are built from distantly related sequences within conserved blocks of sequences
BLOSUMs are built from the BLOCKS database (the BLOCKS database is a secondary database that derives information from the PROSITE Family database)
BLOSUM (cont.1)
Version 8.0 of the Blocks Database consists of 2884 blocks based on 770 protein families documented in PROSITE.
1. To build the BLOSUM 62 matrix one must eliminate sequences that are identical in more than 62% of their amino acid sequences. This is done by either removing sequences from the BLOCK or by finding a cluster of similar sequences and replacing the cluster with a single representative sequence.
2. Next, the probability for a pair of amino acids to be placed in the same column is calculated. In the previous page this would be the probability of replacement of A with A, A with B, A with C, and B with C. This gives the value qij
3. Next, one calculates the frequency that the replacement amino acid exists in nature, fi.
Building BLOSUM Matrices (cont.)
4. Finally, we calculate the log odds ratio si,j= log2 (qij/fi). This value is entered into the matrix.
Which BLOSUM to use?
BLOSUM Identity
80 80% 62 62% (usually default value) 35 35%
If you are comparing sequences that are very similar, useBLOSUM 80.