1 Enriching the sequence substitution matrix by structural information Octavian Teodorescu 1 , Tamara Galor 1 , Jaroslaw Pillardy 2 and Ron Elber 1,* Cornell University, Department of Computer Science 1 and Cornell Theory Center 2 , Upson Hall 4130, Ithaca NY 14853 * to whom correspondence should be sent Professor Ron Elber Department of Computer Science Cornell University Upson Hall 4130 Ithaca NY 41583 Phone: (607)255-7416 Fax: (607)255-4428 e-mail: [email protected]Keywords: sequence alignment, threading, fitness function, sequence-to-structure matching, energy function, Z score.
28
Embed
Enriching the sequence substitution matrix by structural ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Enriching the sequence substitution matrix by structural information
Octavian Teodorescu1, Tamara Galor1, Jaroslaw Pillardy2 and Ron Elber1,*
Cornell University, Department of Computer Science1 and Cornell Theory Center2,
Upson Hall 4130, Ithaca NY 14853
*to whom correspondence should be sent
Professor Ron Elber Department of Computer Science Cornell University Upson Hall 4130 Ithaca NY 41583 Phone: (607)255-7416 Fax: (607)255-4428 e-mail: [email protected] Keywords: sequence alignment, threading, fitness function, sequence-to-structure matching, energy function, Z score.
2
Abstract
A fundamental step in homology modeling is the comparison of two protein sequences: a
probe sequence with an unknown structure and function and a template sequence for
which the structure and function are known. The detection of protein similarities relies
on a substitution matrix that scores the proximity of the aligned amino acids. Sequence-
to-sequence alignments use symmetric substitution matrices while the threading protocols
use asymmetric matrices, testing the fitness of the probe sequence into the structure of the
template protein. We propose a linear combination of threading and sequence-alignment
scoring function, to produce a single (mixed) scoring table. By fitting a single parameter
(which is the relative contribution of the BLOSUM50 matrix and the threading energy
table of THOM2) we obtain a significant increase in prediction capacity in the twilight
zone of homology modeling (detecting sequences with less than 25 percent sequence
identity and with very similar structures). For a difficult test of 176 homologous pairs,
with no signal of sequence similarity, the mixed model makes it possible to detect
between 40 to 100 percent more protein pairs than the number of pairs that are detected
by pure threading. Surprisingly, the linear combination of the two models is performing a
better than threading and than sequence alignment when the percentage of sequence
identity is low. We finally suggest that further enrichment of substitution matrices,
combing more structural descriptors such as exposed surface area, or secondary structure
is expected to enhance the signal as well.
3
Introduction
Annotation and classification of proteins rely on accurate and efficient comparison of
pairs of proteins. An essential ingredient of the comparison algorithm is the substitution
matrix, T . For a pair of amino acid types α and β in environments x and y the
substitution matrix provides a score for their exchange between the two proteins
( ), | ,T T x yα β≡ . The score of an alignment (ignoring for the moment penalties for
indels) is a sum over all substitution scores.
“Environment” consists of additional features ( ),x y to the direct score of amino acid
substitution, which we denote by ( )|T α β . For example, it may include (i) multiple
sequence information [1], (ii) secondary structure data [2], (iii) exposed surface area [3],
and (iv) many other structural and functional fingerprints. Here we consider the
information content of only a pair of proteins. Multiple sequence information (feature (i))
is not discussed here and can be added (in principle) once the scoring of a pair is
optimized.
A class of environment features is the use of structural information. An alignment of a
probe sequence into a shape of another protein is called threading and is usually
associated with an energy function [4]; the energy measures the quality of sequence to
structure fitness [4]. The amino acids are aligned into a known shape and three-
dimensional interactions are scored, measuring protein stability. The sequence to
sequence and sequence to structure alignments are done separately and have their own
4
corresponding substitution matrices. For sequence alignment we have ( )|T α β and for
sequence to structure alignment we use ( )' |T yα .
It is interesting to note that one type of a substitution matrix dominates the scoring of
sequence-to-sequence alignments in proteins (BLOSUM 50 [5]), while there is no
dominant scoring scheme (energy function) of matching sequences into structures. The
BLOSUM 50 matrix was used as an example, since we have considerable experience in
using it and comparing its results with threading approaches [4]. We anticipate a similar
enhancement in recognition for other sequence-substitution matrices; however, we did
not do calculation with other matrices. Part of the reason for the larger diversity of
threading energy functions is the higher complexity of three-dimensional interactions
compared to one-dimensional substitutions, making it more difficult to find the best
choice. Another reason is the significant success of BLOSUM 50 in identifying
evolutionary relationships compared to the much weaker sensitivity of stability energies.
Nevertheless, an interesting complementary relationship was observed in a number of
studies [4]. At the twilight zone of similarity detection by sequence alignment it is
possible to find remote evolutionary relationships by sequence-to-structure matching.
Threading detects a significantly smaller number of similar protein pairs compared to
sequence alignment; however, the set of hits in threading is not a subset of the sequence
alignment hits. Therefore, threading alone is a potentially useful tool when sequence
alignment fails to recover a signal.
Merging threading and sequence signals is done after separate alignments and scoring
was performed. The raw scores or the statistical significance measures (e.g. the Z scores
5
[6]) are combined in an empirical formula [7] or in a neural net [8] to take advantage of
the complementarities of the two techniques.
Here we propose another combination of sequence and structure signals at the level of the
substitution matrix. A new substitution matrix, ( )| ,M yα β , is defined as a linear
combination of ( )|T α β and ( )' |T yα
( ) ( ) ( ) ( )| , | 1 ' |M y T T yα β λ α β λ α= + − (1)
The parameter λ is a constant mixing term between zero and one that we optimize (see
Method section). The new matrix is used in a dynamic programming algorithm to
determine the optimal alignment.
Mixing the scoring of a structural factor and amino acid substitution score was done in
the past in the context of secondary structure (and sequence alignment) [2]. Here we
extend that study to consider an alternative threading score. The hope is that the mixing
will create positive consensus. That is, if the two measures agree that a partial alignment
is good (even if the positive signal is rather weak for each measure), the combined signal
may still be a match. At the same time when one of the scores strongly objects, the
alignment is in doubt even if the second measure shows a positive signal. The hope is
then to enhance the signal of true positives by consensus of the two measures and to
reduce false signals by score conflicts.
If one of the signals is extremely strong and considered significant even alone, then the
mixing is not necessarily beneficial. However, when both signals are not strong, then the
proposed scheme may be helpful. We therefore propose the use of the mixed model for
the twilight zone of detection for sequences that are (at least) lower than 25 percent
sequence identity. In fact as is shown in the Results section that even sequences with
6
only 25 percent sequence identity can carry a significant sequence-to-sequence signal.
We therefore made the threshold for the twilight zone a bit tighter and consider only pairs
of proteins that are structurally related (as measured by the structural alignment program
CE [9], see below) and have no significant sequence-to-sequence signal (defined as a Z
score smaller than 2 for sequence alignment with the BLOSUM 50 matrix). Some of
these pairs are found directly by threading alone, however, a considerable enhancement
in detection is obtained when the mixed model is used.
The CE (Combinatorial Extension) program is a leading protocol for local structure
alignment of two protein chains that has significant success in detecting remote structural
relationships by overlapping the Cα positions of two proteins, minimizing the RMS
distance between the two structures. In brief, CE uses a dynamic programming algorithm
with empirically determined gap and extension penalty (or reward) to determine best
matching local protein segments [9]. A Z score determines the significance of the match.
In this manuscript we compare the mixed model to direct sequence alignment, to direct
threading experiment, and to PSI-BLAST [10]. We show that in the twilight zone of
sequence similarity the mixed model outperforms the other algorithms by wide margins.
Methods
We consider the matching of a probe sequence iS to a known protein with a sequence jS
and structural environment defined by the vector jX . Similarly to the usual notion of
amino acid sequence in which we describe the protein by a one-dimensional list of amino
acids 1 2 ...i i i niS a a a≡ , the vector jX is a one-dimensional sequence of local structure
descriptors. It is given by 1 2 ...j j j mjX x x x= The kla is one of the twenty amino acids, while
7
the klx are finite set of local structural features that we use to describe the structural
environment of an amino acid. These local features can be secondary structure, exposed
surface area, number of contacts, etc. Extensions of the model below taking into account
the different features mentioned above are quite obvious.
Here we rely on our previous experience in designing energy function for threading as
implemented in the program LOOPP. LOOPP (Learning, Observing and Outputting
Protein Patterns) is a fold recognition program that emphasizes threading for annotation.
In addition to prediction, LOOPP also learns energy parameters from native and decoy
sets with the Mathematical Programming approach [6]. Source code and databases of
LOOPP are available from http://cbsu.tc.cornell.edu/software/loopp.
One of the potential that we optimized for annotation is THOM2. In THOM2 the energy
( )( ),i n mαε of a structural site i is determined by the identity of the amino acid at the
site, α , the number of neighbors to the site, n , and the number of neighbors to each of
the direct neighbors of the site, m [6]. The total THOM2 energy, 2THOME , is:
( )2,
,THOM ii n
E n mαε=∑ . In summary, THOM2 describes the environment by two layers of
contacts to the structural site. It is a two dimensional table that provides a numerical
value using two indices: (i) a type of an amino acid α and (ii) a pair of layers ( ),n m .
The number of structural environments in THOM2 (sixteen; possible combination of
coarse grained ( ),n m pairs) is comparable to the number of amino acid types, making the
size of the THOM2 table comparable to that of BLOSUM 50.
In LOOPP we use dynamic programming algorithm [11] to find an optimal alignment of
iS against jS and (separately) against jX . We use both, local and global alignment, and
8
in the final evaluation of the significance of the match we use the Z score. The same
alignment algorithm is used in the model described below.
The mixed model defines a combined alignment of a probe sequence iS with another
protein whose sequence jS and structure jX are known. A dynamic programming
algorithm is used. At every step of building the dynamic matrix we consider local
matches. The fitness of an amino acid ka against the pair ( ),l lb x is measured with a new
(mixed) substitution matrix
( ) ( ) ( ) ( )50 2| , | 1 |k l l BLOSUM k l THOM k lM a b x T a b T a xλ λ−= + − (2)
The entries ka and lb are the types of the amino acids in positions k and l along the
sequence. The entry lx is the structural environment of position l . The (single) free
parameter λ is the mixing parameter and was chosen empirically to be 0.125. In the
Results section we explain how λ was computed and demonstrate that the model is not
very sensitive to the choice of the mixing parameter. Ideally we may imagine λ being a
parameter of the actual scores. For example, if the sequence-to-sequence signal is very
high (e.g. above 60 percent sequence identity), then we anticipate the mixing parameter
to be one. However, in the present study we did not perform an optimization ofλ for the
complete range and consider its uses only for “difficult-to-annotate” sequences.
The matrix 50BLOSUMT − is the BLOSUM 50 substitution matrix [5], and 2THOMT is the
threading energy of the THOM2 model [6]. The two matrices that we have used can be
downloaded from the web http://cbsutest.tc.cornell.edu/var/loopp_testset/
9
To evaluate the performances of the model, and to examine different values of the mixing
parameter we have constructed a test set containing 176 sequences with lengths between
100 and 500 amino acids. The sequences are listed in table 1, and the complete test is
available from http://cbsutest.tc.cornell.edu/var/loopp_testset/
Table 1: The 176 sequences of the test set. Each of the sequences has its own set of decoys and homologous structures. The goal is to identify as many as possible homologous pairs. 1iaa 1fat_A 3a3h 1piv_1 1qe5_A 1fsl_A 1eoe_A 1fvw_L 1ita 1iif_H 2snv 3crd 1irs_A 1aag_H 1rcy 1jli 1eod_A 1lxd_A 1neu 1ece_A