BLOSUM Scoring Matrices • BLOck SUbstitution Matrix • Based on comparisons of Blocks of sequences derived from the Blocks database • The Blocks database contains multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins (local alignment versus global alignment) • BLOSUM matrices are derived from blocks whose alignment corresponds to the BLOSUM-,matrix number (e.g. BLOSUM 62 is derived from Blocks containing >62% identity in ungapped sequence alignment) • BLOSUM 62 is the default matrix for the standard protein BLAST program
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
BLOSUM Scoring Matrices
• BLOck SUbstitution Matrix
• Based on comparisons of Blocks of sequences derived from the Blocks database
• The Blocks database contains multiply aligned ungapped segments corresponding to themost highly conserved regions of proteins (local alignment versus global alignment)
• BLOSUM matrices are derived from blocks whose alignment corresponds to theBLOSUM-,matrix number (e.g. BLOSUM 62 is derived from Blocks containing >62%identity in ungapped sequence alignment)
• BLOSUM 62 is the default matrix for the standard protein BLAST program
BLOSUM Background
• Prosite data base: “dictionary of sites and patterns in proteins”; linked to Swiss-Protdatabase
• Goal is to identify “biologically significant” patterns in protein families (withspecial emphasis on those regions thought to be important to protein function)
• Tries to find good “discriminators” that emphasize reliable identification of knownfamily members while excluding known non-members
• Prosite patterns: signature “motifs”
• Example: Helicase proteins
• involved in unwinding and opening of DNA strands in preparation for transcription
• “Werner’s syndrome”: mutation in helicase causes affected individuals to age at aan accelerated rate
• Hundreds of helicases from different organisms have been sequenced; much ofwhat we know about how they work comes from computer-assisted analysis ofthese sequences
BLOSUM Background (continued)
• Motifs: features conserved across all sequences from a family (e.g., helicases) or acrossdifferent subsets of them
• These motifs can be used to search protein/DNA databases to discover previouslyunknown members
• “Family” typically defined by function: helicases share in common the property ofhelping to unwind DNA
By finding new helicases and asking what they have in common, we can betterunderstand their mechanics
Why blocks?• Need to have a multiple alignment; easier to align with similar sequences• Don’t want insertions and deletions to complicate estimation of substitution
probabilities• Interested in detecting conserved regions of protein sequences, so restrict attention to
these regions when computing the scoring matrix
Henikoff and Henikoff (1991) developed a database of “blocks” based on sequences withshared motifs (>2,000 blocks of aligned sequence segments from >500 groups of relatedproteins)
E.g.:
1. Count pair frequencies for each pair of amino acids i and j, for each column k of eachblock:
where ni = the number of times residue i was observed in the column
Just as with the PAM matrix, we will compute the BLOSUM score as the (log) ratio of theobserved probability of substitution of one amino acid by another divided by theprobability expected purely due to chance. First the numerator: