a7e66Scoring Matrices

COMPUTATIONAL BIOLOGY

B.Tech – BioTech (VIth Semester)

Module 2

Scoring Matrices

INTRODUCTION• It is assummed that the sequences being sought have an

evolutionary ancestral sequence in common with the query sequence.

• The best guess at the actual path of evolution is the path that requires the fewest evolutionary events.

• All substitutions are not equally likely and should be weighted to account for this.

• Insertions and deletions are less likely than substitutions and should be weighted to account for this.

INTRODUCTION• A substitution is more likely to occur between amino acids

with similar biochemical properties.• For example the hydrophobic amino acids Isoleucine(I) and

valine(V) get a positive score on matrices adding weight to the likeliness that one will substitute for another.

• While the hydrophobic amino acid isoleucine has a negative score with the hydrophilic amino acid cystine(C) as the likeliness of this substitution occurring in the protein is far less.

• Thus matrices are used to estimate how well two residues of given types would match if they were aligned in a sequence alignment.

IMPORTANCE OF SCORING MATRICES

• Scoring matrices appear in all analysis involving sequence comparison.

• The choice of matrix can strongly influence the outcome of the analysis.

• Scoring matrices implicitly represent a particular theory of evolution.

• Understanding theories underlying a given scoring matrix can aid in making proper choice.

TYPES OF SCORING MATRICES• An amino-acid scoring matrix is a 20x20 table such that position

indexed with amino-acids so that position X,Y in the table gives the score of aligning amino-acid X with amino-acid Y

• Identity matrix – Exact matches receive one score and non-exact matches a different score (1 on the diagonal 0 everywhere else)

• Mutation data matrix – a scoring matrix compiled based on observation of protein mutation rates: some mutations are observed more often then other (PAM, BLOSUM).

• Physical properties matrix – amino acids with with similar biophysical properties receive high score.

• Genetic code matrix – amino acids are scored based on similarities in the coding triple.

Matrices used

PSSM = Position Specific Scoring Matrices

PAM matrices

BLOSUM (BLOck Substitution Matrices)

• Publication– Henikoff and Henikoff, 1992

• Motivation– PAM matrices do not capture the difference between

short and long time mutations • Method

– For several degrees of sequence divergence, derive mutations from set of related proteins

– BLOSUM-k is based on related proteins with k% identity or less

BLOSUM METHOD

• Use Blocks – collections of multiple alignments of similar segments without gaps

• Cluster together sequences whenever more than k% identical residues are shared

• Count number of substitutions across different clusters (in the same family)

• Estimate frequencies using the counts

BLOCKS

Each BLOCK represents a conserved region in a group of proteins

1 5 n

sequence 1 ABPEDG… …FGW

sequence 2 ABSEDQ… …QGW

sequence 3 SBPEDQ… …FGD

: : :

: : :

sequence m ABAEDS… …QGD

BLOSUM = BLOCK SUBSTITUTION MATRIX

The relationship between BLOSUM and PAM substitution matrices

• BLOSUM matrices with higher numbers and PAM matrices with low numbers are both designed for comparisons of closely related sequences.

• BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins.

Position-Specific Scoring Matrix

• A weight matrix or position-specific scoring matrix (PSSM) is a table of numbers containing scores for each residue at each position of a fixed-length (gap-free) motif.

• There are two types of numerical representations:• frequency matrix: reflects position-dependent frequencies

of residues • Scoring matrix: contains additive weights for computing a

match score• Weigh matrices or PSSMs are quantitative, fixed-length motif

descriptors. Unlike regular expressions, they can distinguish between mild and severe mismatches.


• A PSSM is a motif descriptor• The descriptor includes a weight (score, probability) for each

symbol occurring at each position along the motif• Examples of motifs:

– Protein active sites, – structural elements, – zinc finger, – intron/exon boundaries, – transcription-factor binding sites, etc.


Construction of PSSM is a multi-stage process:1. Architecture of matrix2. Create multiple alignment from which the matrix is

derived3. Calculate frequencies for each position4. Applying BLAST to PSSM

Position-Specific Scoring Matrix• 10 vertebrate donor site sequences aligned at

exon/intron boundaryseq 1 GAGGTAAAC

seq 2 TCCGTAAGT

seq 3 CAGGTTGGA

seq 4 ACAGTCAGT

seq 5 TAGGTCATT

seq 6 TAGGTACTG

seq 7 ATGGTAACT

seq 8 CAGGTATAC

seq 9 TGTGTGAGT

seq 10 AAGGTAAGT

Position-Specific Scoring Matrix• Calculate the absolute frequency of each

nucleotide at each positionseq 1 GAGGTAAAC

seq 2 TCCGTAAGT

seq 3 CAGGTTGGA

seq 4 ACAGTCAGT

seq 5 TAGGTCATT

seq 6 TAGGTACTG

seq 7 ATGGTAACT

seq 8 CAGGTATAC

seq 9 TGTGTGAGT

seq 10 AAGGTAAGT

1 2 3 4 5 6 7 8 9

A 3 6 1 0 0 6 7 2 1

C 2 2 1 0 0 2 1 1 2

G 1 1 7 10 0 1 1 5 1

T 4 1 1 0 10 1 1 2 6

Position-Specific Scoring Matrix• Calculate the relative frequency of each

nucleotide at each positionseq 1 GAGGTAAAC

seq 2 TCCGTAAGT

seq 3 CAGGTTGGA

seq 4 ACAGTCAGT

seq 5 TAGGTCATT

seq 6 TAGGTACTG

seq 7 ATGGTAACT

seq 8 CAGGTATAC

seq 9 TGTGTGAGT

seq 10 AAGGTAAGT

1 2 3 4 5 6 7 8 9

A 3 6 1 0 0 6 7 2 1

C 2 2 1 0 0 2 1 1 2

G 1 1 7 10 0 1 1 5 1

T 4 1 1 0 10 1 1 2 6

1 2 3 4 5 6 7 8 9

A 0.3 0.6 0.1 0 0 0.6 0.7 0.2 0.1

C 0.2 0.2 0.1 0 0 0.2 0.1 0.1 0.2

G 0.1 0.1 0.7 1 0 0.1 0.1 0.5 0.1

T 0.4 0.1 0.1 0 1 0.1 0.1 0.2 0.6

Position-Specific Scoring Matrix• What is the probability of finding CAGGTTGGA?

– The product of the frequency of each nucleotide at each position:

– C is 0.2 at position 1, A is 0.6 at position 2, etc ->• 0.2 * 0.6 * 0.7 * 1 * 1 * 0.1 * 0.1 * 0.5 * 0.1

1 2 3 4 5 6 7 8 9

A 0.3 0.6 0.1 0 0 0.6 0.7 0.2 0.1

C 0.2 0.2 0.1 0 0 0.2 0.1 0.1 0.2

G 0.1 0.1 0.7 1 0 0.1 0.1 0.5 0.1

T 0.4 0.1 0.1 0 1 0.1 0.1 0.2 0.6

a7e66Scoring Matrices

Documents

matrices usedpssm

motivationpam matrices

given scoring matrix

aminoacid x

choice of matrix

scoring matricesintroductionit

different score

high score