-
Exploiting the Co-evolution of Interacting Proteins toDiscover
Interaction Specificity
Arun K. Ramani1 and Edward M. Marcotte1,2*
1Institute for Cellularand Molecular BiologyCenter for
ComputationalBiology and BioinformaticsUniversity of Texas at
AustinAustin, TX 78712, USA
2Department of Chemistryand BiochemistryUniversity of Texas at
AustinAustin, TX 78712, USA
Protein interactions are fundamental to the functioning of
cells, and highthroughput experimental and computational strategies
are sought to mapinteractions. Predicting interaction specificity,
such as matching membersof a ligand family to specific members of a
receptor family, is largely anunsolved problem. Here we show that
by using evolutionary relationshipswithin such families, it is
possible to predict their physical interactionspecificities. We
introduce the computational method of matrix alignmentfor finding
the optimal alignment between protein family similaritymatrices. A
second method, 3D embedding, allows visualization of inter-acting
partners via spatial representation of the protein families.
Thesemethods essentially align phylogenetic trees of interacting
protein familiesto define specific interaction partners. Prediction
accuracy dependsstrongly on phylogenetic tree complexity, as
measured with informationtheoretic methods. These results, along
with simulations of proteinevolution, suggest a model for the
evolution of interacting protein familiesin which interaction
partners are duplicated in coupled processes. Usingthese methods,
it is possible to successfully find protein
interactionspecificities, as demonstrated for .18 protein
families.
q 2003 Elsevier Science Ltd. All rights reserved
Keywords: protein interactions; bioinformatics; phylogeny;
co-evolution;interaction specificity*Corresponding author
Introduction
Protein interaction specificity is vital to cellfunction, but
the maintenance of such specificityrequires that it persist even
through the courseof strong evolutionary change, such as the
dupli-cation and divergence of genes. Binding speci-ficities of
duplicate genes (paralogs) often diverge,such that new binding
specificities are evolved.Given that such paralogous gene families
abound,such as the .560 serine-threonine kinases in thehuman
genome,1 predicting interaction specificitycan be difficult,
especially when paralogs existfor both interaction partners. In
these cases, thenumber of potential interactions grows
combina-torially. This ambiguity can easily complicate thematching
of ligands to specific receptors, and forsuch reasons,
identification of ligands for orphanreceptors is an important, but
largely unsolved,problem.2 – 4
Computational methods for discovering specificprotein
interactions fall into three broad categories:(i) the
identification of specific protein sequence or
structural features indicative of protein interactionpartners,
such as sequence signatures,5 correlatedmutations,6,7 and surface
patches;8,9 (ii) the use ofgenomic context10 to identify
interaction partners,exploiting information such as gene
order,11,12 genefusions,13,14 and phylogenetic profiles;15 and
(iii) theuse of phylogenetic trees to account for theco-evolution
of interacting proteins.16 – 20
Of these three classes, the third is of specificinterest: the
hypothesis underlying these approachesis that interacting proteins
often exhibit coordi-nated evolution, and therefore tend to have
similarphylogenetic trees. Goh et al.17 demonstrated thisby showing
that chemokines and their receptorshave very similar phylogenetic
trees, as do indi-vidual domains of a single protein such
asphosphoglycerate kinase. Detailed phylogeneticstudies of the
two-component signal transductionsystem18 show that a phylogenetic
tree constructedfrom two-component sensor proteins has a
similarstructure to that from two-component regulatorproteins.
Here, we exploit this tendency for interactingproteins to have
similar phylogenetic trees, andpresent a general computational
method for theidentification of specific interaction partners
in
0022-2836/03/$ - see front matter q 2003 Elsevier Science Ltd.
All rights reserved
E-mail address of the corresponding
author:[email protected]
doi:10.1016/S0022-2836(03)00114-1 J. Mol. Biol. (2003) 327,
273–284
-
such protein families. We provide an information-theoretic
interpretation of when the method isappropriate, and present a
model that emerges forthe evolution of interacting proteins.
Results
Prediction of interactions by matrix alignment
Figure 1(A) presents the phylogenetic trees oftwo families of
interacting proteins, the Ntr-typetwo-component sensors and their
correspondingregulators. There is striking similarity in the
rela-tive placement of interacting protein pairs acrossthe two
trees: The ntrC proteins from Escherichiacoli and Salmonella
typhimurium are adjacent in the
regulator tree, as are their interaction partners(ntrB) in the
sensor tree. Likewise, the ntrC pro-teins are roughly equidistant
in the regulator treefrom the hydG regulator proteins; this
relationshipis maintained by their interacting partners in
thesensor tree. Many details of the overall tree struc-ture are
shared between the ligand and receptortree, as noted previously for
two-componentsensor/regulators18 and for
chemokines/chemokinereceptors.17
Figure 1(B) presents the simplest such case ofinteraction
partners, in which each interactingprotein (e.g. GyrA and GyrB) has
a single paralog(e.g. ParC and ParE, respectively, which
interactspecifically with each other). Again, the trees ofthe
interacting partners are notably similar. In fact,even the halves
of the trees specific to each paralog
Figure 1. (A) A comparison of the phylogenetic trees of
Ntr-family two-component sensor histidine kinases and
theircorresponding regulators. Circles enclose orthologous genes.
Interacting proteins, colored similarly, sit in similarpositions in
the two trees. (B) A comparison of the phylogenetic tree of the
GyrA and ParC proteins with the treeof their corresponding
interaction partners, GyrB and ParE, colored as in (A). Bold arrows
indicate an example ofdiffering branch lengths, which help to
distinguish the Gyr and Par subtrees.
274 Co-evolution of Interacting Proteins
-
are similar, as the GyrA half strongly resemblesboth the GyrB
and ParE halves. However, a carefulexamination of branch lengths
indicates subtledifferences between the halves, such as is
indicatedby the arrows in Figure 1(B), such that the
correctinteraction partners (GyrA with GyrB, and ParCwith ParE)
have the most similar subtrees.
In order to exploit the evolutionary informationcontained in
such interacting protein families,we developed an algorithm that is
conceptuallyequivalent to superimposing the phylogenetictrees of
the two protein families. This approach,which we term matrix
alignment and which isimplemented in the program MATRIX, is
dia-grammed schematically in Figure 2.
Rather than directly compare the phylogenetictrees, the
corresponding similarity matrices arecompared to each other, each
matrix summarizingthe evolutionary relationships between the
pro-teins within one sequence family. One matrix isshuffled,
maintaining the correct relationshipsbetween proteins but simply
re-ordering them in
the matrix, until the two matrices maximallyagree, minimizing
the root mean square differencebetween elements of the two
matrices. Interactionsare then predicted between proteins
headingequivalent columns of the two matrices. For matrixalignment,
MATRIX currently applies a stochasticsimulated annealing-based
algorithm.
Matching two-component sensors to regulators
As a first test of matrix alignment, we examinedthe Ntr-type
two-component sensor and regu-lator families of Figure 1. Binding
partners wereassigned according to the KEGG pathwaydatabase21
resulting in a set of 14 interactions,spanning genes from eight
organisms. Matrixalignment was performed, testing
specificallywhether or not the genes from one genome (forexample,
the four E. coli regulators) could bematched to their correct
binding partners (here,the four E. coli sensor proteins).
Figure 2. The matrix alignmentmethod for predicting protein
inter-action specificity. Proteins in familyA interact with those
in familyB. In each family, a similarity matrixsummarizes the
proteins’ evolution-ary relationships. The algorithmuses the
similarity matrices to pairup the genes in the two families.Columns
of matrix B are re-ordered(along with their correspondingrows in
the matrix) such that theB matrix agrees maximally withmatrix A,
judged by minimizingthe root mean square difference(r.m.s.d.)
between elements in thetwo matrices. Interactions are thenpredicted
between proteins headingequivalent columns of the twomatrices.
Co-evolution of Interacting Proteins 275
-
The results following 100 runs of simulatedannealing are
presented in Table 1 (and later sum-marized in Figure 4(A)).
Diagonal entries in thetable correspond to the correct binding
partners,and the values reported in each table cell indicatethe
fraction of simulated annealing runs in which
the corresponding proteins were predicted to bebinding partners.
For example E. coli atoS is pairedcorrectly with E. coli atoC 95%
of the time (in 95 ofthe 100 runs); as this match outscores any
othermatch to atoS or atoC, these are predicted to beinteraction
partners. In a typical run, the starting
Table 1. The prediction of protein interactions between
interacting protein families by the method of matrix alignment
The top table indicates the predicted interactions between
Ntr-type two-component sensors and regulators, and the bottom
tableindicates the predicted interactions between CKR-type
chemokines and chemokine receptors. The diagonal of each matrix
representsthe correct known interacting pairs based on the
assignments of the KEGG database (top) or measured binding
affinities (bottom).Each Table entry represents the fraction of
matrix alignment runs in which a given interaction was predicted.
Filled boxes representthe predicted interaction partners observed
in the highest fraction of the runs, while broken line boxes
represent the interactionpartners predicted when allowing
interactions between orthologs. There is an ambiguity in the
interaction partners of the chemokine/chemokine receptors,
indicated by bold broken boxes, leading to either two correct or
two incorrect predictions.
276 Co-evolution of Interacting Proteins
-
r.m.s.d. between the sensor and regulator similaritymatrices was
,0.242; following application ofthe algorithm, it was ,0.207. For
comparison, thecorrect pairing corresponded to an r.m.s.d. of0.181,
indicating that the algorithm typicallyfound a solution that
efficiently minimized ther.m.s.d. but still did not find the global
optimumfrom among the 14!, or ,1011, possible solutions.
To assess the accuracy of the interaction predic-tion, two
values were examined: the stringentaccuracy, defined as the
accuracy of exact matchesof known binding partners, and the
effective accu-racy, which was evaluated by accepting matchesto
orthologous protein family members (suchas correctly matching ntrB
to ntrC, but with thematch occurring between the E. coli protein
andthe S. typhimurium protein, rather than E. coli withE. coli.)
Because the species is known in everycase, we can typically
increase the accuracy by con-sidering the orthologs. For the
Ntr-type two-com-ponent regulator/sensor case, the
stringentaccuracy was 57% while the effective accuracywas 86%. All
four E. coli proteins were correctlymatched to their interaction
partners, as werethe S. typhimurium proteins. Thus, inherent
infor-mation exists in the phylogenetic trees of the twofamilies
that can be automatically extracted topredict protein interaction
partners.
Visualization of protein interaction partners by3D embedding
In order to summarize in a clear manner themany evolutionary
relationships and interactions,we developed a method, termed 3D
embeddingand diagrammed in Figure 3, for effectivelyvisualizing the
aligned similarity matrices andpredicted protein interaction
partners: coordinatesin three-dimensional space are assigned to
proteinsin a sequence family such that the spatial separa-tion of
the proteins is proportional to the evolu-
tionary distances between the proteins describedin the
similarity matrix. Protein interaction part-ners can then be
visualized by assigning coordi-nates to each protein in the two
protein familiesthat interact with each other, followed by
super-position of one family onto the other by leastsquares
minimization of the distance betweeninteracting partners. During
this superposition,the relative distances between the proteins of
asequence family are unchanged. Instead, onlythe orientation of the
resulting “constellation” ofproteins in one family is changed
relative to theproteins of the other family, as shown in Figure
3.
Figure 4(A) shows the application of 3Dembedding to the Ntr
regulator/sensor proteins.In this example, the proteins are aligned
such thatthe distances between the predicted interactionpartners
are minimized. As can be seen in theFigure, proteins cluster in
distinct regions inspace, mirroring the adjacent placement of
ortho-logs in the phylogenetic trees of Figure 1. Inter-acting
protein partners generally sit close to eachother in space.
Orthologs appear to exhibit littleapparent preference for their
precise positionswithin a particular spatial cluster, consistent
withthe tendency of the matrix alignment algorithmto assign
interactions to orthologous proteinsequences rather than the
sequences of the correctspecies. From Figure 4(A), it is obvious
that matrixalignment succeeds in finding quite complexrelationships
that successfully satisfy the manyconstraints, such as matching
yfhA to yfhK, ratherthan the potentially closer hydH, in order
thatboth S. typhimurium and E. coli hydH interactionscould be
predicted.
Figure 4(B) shows the application of 3D embed-ding to the
simpler problem of matchinginteraction partners given the right
pair and ahomologous pair as competition. The solutiondemonstrates
the extreme robustness of matrixalignment for such simple cases.
Here, interactions
Figure 3. To visualize proteinfamilies, proteins are plotted in
3Dspace such that each protein isseparated from other proteins in
itsfamily by distances dij proportionalto the evolutionary
similarities sijin the family’s similarity matrix. Tovisualize
interactions between twoprotein families (labeled A and B),the
families are superimposed byrigid-body least-squares fit of
thepredicted interaction partners ontoeach other.
Co-evolution of Interacting Proteins 277
-
are mapped between the homologs GyrA and ParC(from ten
organisms, as shown in Figure 1(B)) withtheir respective
interaction partners GyrB andParE. In the Figure, the Gyr proteins
are spatiallywell-separated from the Par proteins, illustratingthe
ability of 3D embedding to separate membersof a protein family into
their functional subtypes.In all cases, GyrA proteins are paired
with GyrBproteins, while ParC proteins are paired with
ParEproteins. As with Figure 4(A), the interactingpartners tend to
be clustered in space. In all, 14out of the 20 interactions are
predicted correctly;when matches to orthologs are allowed, all
20interactions (100%) are correctly predicted.
The effects of phylogenetic tree structure oninferring protein
interactions
Since phylogenetic relationships and tree struc-ture form the
foundation of this approach, weinvestigated the importance of tree
structure to themethod’s success. For example, we expect pairs
ofproteins in a tree that are highly similar to eachother to be
difficult to distinguish when assigninginteraction partners, as in
the case of the E. coli/S. typhimurium ntrC/ntrB proteins of Figure
1(A)that are incorrectly paired up in Table 1. Severalsuch pairs of
similar proteins can even lead toalternate, equally scoring
solutions, as is the case
Figure 4. (A) A side-by-side stereo diagram representing the
predicted and known interactions between Ntr-typetwo-component
sensors (dark spheres) and regulators (light spheres). For both A
and B continuous lines indicate inter-actions predicted by matrix
alignment and broken lines indicate known interaction partners for
cases with incorrectpredictions. 12 out of 14 interactions are
correctly predicted; if predictions to orthologous proteins are
allowed, onlythe predictions for A. aeolicus are incorrect. (B)
Stereo diagram of the interactions between GyrA (dark gray
spheres)and its homolog ParC (black spheres) with their respective
interaction partners GyrB (light gray spheres) and its homo-log
ParE (white spheres). The Gyr and Par proteins are separated into
distinct spatial regions in the process of 3Dembedding. With the
exception of the C. crescentus proteins, interaction partners
consistently sit adjacent to one anotherin space.
278 Co-evolution of Interacting Proteins
-
for the CKR-type chemokines and their receptorsin Table 1. In
this example, the mouse/rat EOTAchemokines are predicted to bind
the mouse/ratCKR2 and CKR3 receptors with equal confidence,so the
precise binding partners are obscured bythis underlying symmetry in
the phylogenetictrees.
In order to systematically test the relationshipbetween tree
structure and matrix alignment, pro-tein phylogenetic trees with
differing complexitieswere created by simulating the evolution of
asingle protein into a protein family. Pairs of trees,representing
co-evolved interaction partners, werecreated in coupled simulations
and were analyzedby matrix alignment. By systematically varyingthe
complexity of the trees created, the contri-bution of tree
complexity to the effectiveness ofmatrix alignment could be
examined.
For a given simulation of one protein (the pro-genitor protein)
evolving into a family, tree com-plexity was controlled by
specifying the frequencyat which the progenitor protein was
duplicated ascompared to other proteins in the growing tree.Each
new protein was added to the family byduplicating, with mutation,
an existing proteinunder the following rule: the progenitor
proteinwas duplicated with probability p0; and a differentprotein
in the family (chosen at random) wasduplicated with probability 1 2
p0: In this way,trees generated with p0 , 1 are composed only
ofdirect duplications of the progenitor protein, withall proteins
approximately the same evolutionarydistance from each other. These
trees are quitesimple and approximately radial in structure,
asillustrated in the inset in the top panel of Figure 5.In
contrast, trees generated with p0 , 0 are morecomplex in structure,
since lifting the requirementto duplicate the progenitor protein
allows morecomplex patterns of duplications to occur andproduces
more diverse evolutionary relationshipsbetween the proteins.
To simulate the evolution of protein interactionpartners, two
families were “evolved” in a coupledfashion from two initial seed
sequences, generatedrandomly as described in Materials and
Methods,with the choice of protein to be duplicated at eachstep
forced to be equivalent for the two families.For example, if in
protein family A, the second pro-tein was duplicated to create the
third, then thesecond protein would be duplicated to create
thethird in family B as well. In this manner, the treeswould be
similar, though not identical, as stochas-tic mutations were
introduced with each dupli-cation as described in Materials and
Methods.
Following each simulation, interactions betweenthe two simulated
interacting sequence familieswere predicted by matrix alignment.
The results,plotted in Figure 5(A), indicate that tree complexityis
strongly correlated with algorithm performance.Predictive accuracy
increases with increasing treecomplexity, consistent with our
intuition thatsimple trees are ambiguous about relationshipsbetween
proteins, and therefore are less useful for
Figure 5. The accuracy of matrix alignment dependsstrongly on
the complexity of the phylogenetic trees.(A) Simulations of the
evolution of interacting proteinsindicate that the tree complexity,
measured by constrain-ing simulated trees to be more or less
radial, limitsthe accuracy of matrix alignment. As tree
complexityincreases, accuracy increases. This relationship
isexploited in (B) (top panel), which shows that mutualinformation
of similarity matrices correlates with predic-tion accuracy.
Results from simulations involving pairsof protein families of
different sizes indicate that as themutual information of the
similarity matrices increases,interaction prediction accuracy
increases. Mutual infor-mation values are calculated in bins of
width 0.1 ((B),bottom panel). This trend is confirmed in 34 actual
inter-acting protein families, listed in Table 2. By
allowingmatches to orthologous proteins, the effective accuracyof
the algorithm (white diamonds) is considerably higherthan the
stringent accuracy from exact matches (blacksquares). Matrix
alignment significantly outperformsrandom choices of interaction
partners (white squares).
Co-evolution of Interacting Proteins 279
-
predicting interactions in the manner we havedescribed.
A score that quantitatively predicts theaccuracy of matrix
alignment
As simulations demonstrate a clear dependenceof the success of
matrix alignment upon thecomplexity of the phylogenetic trees, we
asked if ameasure of agreement between similarity matricesthat also
considered tree complexity would accu-rately predict the
algorithm’s performance. Onesuch measure is the mutual
information22 of thesimilarity matrices, which is a function of
boththe entropy of the matrices, taking into accountthe
phylogenetic tree complexity, and the agree-ment of the two
similarity matrices with eachother.
Interaction prediction accuracy was compared tothe mutual
information of the similarity matricesfrom simulations of pairs of
co-evolving familiesof 10, 15, or 20 proteins of varying tree
complexity.Results, plotted in Figure 5(B), (top) indicate thatthe
mutual information correlates well with theprediction accuracy,
with higher values of mutualinformation corresponding to higher
prediction
accuracy. No significant dependency of themeasure on the size of
the protein family wasobserved.
To extend this analysis to real data and test thegeneral
applicability of matrix alignment, weevaluated its performance on
34 sets of actual pro-tein interaction partners, listed in Table 2,
inclu-ding the Omp, Nar, Cit, and Lyt-type two-component
sensor/regulator proteins, the CKRand CCR-type chemokine/chemokine
receptors,and membrane/substrate binding protein andinteracting
membrane protein components of ABCtransporters. We tested simpler
binary interactions,such as matching the paralogs GyrA and ParCwith
their specific partners, GyrB and ParE,respectively. Finally, we
also tested the matchingof phylogenetic trees composed of single
inter-action partners but from multiple species to see ifthey lent
themselves to a similar analysis. Each setof interaction partners
was analyzed by matrixalignment, and the prediction accuracy from
theanalyses (reported in Table 2) was compared tothe mutual
information of the correspondingsequence similarity matrices.
A plot of the mutual information values againstthe prediction
accuracy (bottom panel of Figure
Table 2. The performance of matrix alignment at predicting
diverse protein interaction partners
Interacting protein familiesNo. of
proteinsaEffective
accuracy (%)Stringent
accuracy (%)Mutual
information
Chemokine/receptor—mouse/human/rat 31 48.4 12.9
0.59Chemokine/receptor—human 13 NA 23.1 0.55CKR-type
chemokine/receptor—mouse/human/rat 18 55.5 33.3 0.79CCR-type
chemokine/receptor—mouse/human 6 100 33.3 0.93Omp-type
regulator/sensors—E. coli 14 NA 21.4 0.48Omp-type
regulator/sensors—B. subtilis 13 NA 7.7 0.64Omp-type
regulator/sensors—5 bacteria 16 43.8 31.3 0.56Omp-type
regulator/sensors—E. coli/B. subtilis 27 NA 18.5 0.35Nar-type
regulator/sensors—8 bacteria 22 36.4 36.4 0.47Ntr-type
regulator/sensors—8 bacteria 14 85.7 57.1 0.62Cit-type
regulator/sensors—E. coli/B. subtilis 5 100 100 0.77Lyt-type
regulator/sensors—E. coli/B. subtilis 4 50 50 1.09Two component
sensor/regulators—E. coli 27 NA 7.4 0.39Lyt-, Ple-, and
“other”-type regulator/sensors—8 bacteria 20 NA 5 0.35CheA/CheY—11
bacteria 13 69.2 69.2 0.83ABC transporter membrane protein 1/2—E.
coli 19 NA 26.3 0.45ABC transporter memb./binding prot.—E. coli 17
NA 0 0.43ABC transporter membrane protein 1/2—H. influenzae 14 NA 0
0.46ABC transporter memb./binding prot.—H. influenzae 13 NA 11.1
0.42GyrA/B,ParC/E—a-proteobacteria 20 100 70 1.29GyrA/B,ParC/E—Gram
positive bacteria 28 100 46.4 0.97
Single interaction partners from multiple
organismsCheA/CheB—bacteria 8 NA 100 0.86Acetyl CoA carboxylase a/b
Gram positive bacteria 9 NA 33.3 0.94Acetyl CoA carboxylase a/b
proteo bacteria 16 NA 75 1.12Succinate CoA synthetase a/b proteo
bacteria 22 NA 81.8 0.83Succinate CoA synthetase a/b archaea 13 NA
30.8 0.91GyrA/GyrB—a-proteobacteria 20 NA 72.7 1.29GyrA/GyrB—Gram
positive bacteria 18 NA 50 1.02GyrA/GyrB—archaea 10 NA 20
0.56Pyruvate dehydrogenase a/b—bacteria 17 NA 52.9
0.76ParC/ParE—bacteria 26 NA 61.5 1.00ParC/ParE—a-proteobacteria 12
NA 66.6 1.40ParC/ParE—Gram positive bacteria 14 NA 57.1 1.26DNA
polymerase III E2/E3—bacteria 20 NA 45 0.82
a Number of proteins in a family of interacting proteins (e.g.
number of columns in the corresponding similarity matrix).
280 Co-evolution of Interacting Proteins
-
5(B)) shows a clear positive correlation (R ¼ 0:7;accuracy ¼
(63.29 £ MI) 2 7.35), significantly out-performing random
expectations and indicatingthat mutual information can be used as
an inde-pendent measure of the prediction accuracy. Amutual
information value of 0.9 correspondsroughly with a stringent
prediction accuracy of50%; a mutual information value of 1.3
corre-sponds to ,75% accuracy. The effective accuraciesconsistently
exceed these values. The trend linefrom the simulations agrees
within error with theactual protein interactions examined,
indicatingthat the mutual information measure correctlymodels both
phylogenetic tree complexity andsimilarity, and is an appropriate
measure for theprediction of protein interaction partners.
Discussion
Here, we present an automated method topredict protein
interaction partners based uponsimilarity between the phylogenetic
trees of inter-acting proteins. The method is effective,
especiallywhen combined with a quantitative score thatcorrectly
predicts the method’s performance thatarises from an information
theoretic analysis ofthe complexity of the phylogenetic trees
andtheir similarity to each other. Although we havespecifically
focused on interacting protein familiesof identical size, the
method is easily generalizedto families of different sizes by
finding the subsetof proteins in the larger family that best
matchesthe proteins in the smaller family. Also, we havepresented
an approach based on optimization; it isreasonable to expect that
methods of lower algo-rithmic complexity are available. Although
wedescribe the hardest case for the algorithm, inwhich any protein
can interact with any partner,in practice a branch-and-bound
approximationis likely to greatly reduce the search spaceand
improve the algorithm’s performance. Thisimprovement could be made
by allowing simi-larity matrix columns to be exchanged onlybetween
proteins of the same species. However,for the case in which all
proteins derive from oneorganism (for example, the human
chemokinesand receptors), such an improvement is ineffective,and
algorithmic complexity will have to bereduced by other
approaches.
Simulations of protein evolution indicate whenthe alignment of
phylogenetic trees is expected tobe informative. For low complexity
trees, proteinsare not uniquely different from each other;
theconsequence of this trend is that little informationis stored in
the tree that allows it to be orientedunambiguously to another
tree. For complex phylo-genetic trees, proteins have sufficiently
uniquepatterns of similarity that alignments of such treesare
unambiguous and more likely to lead to suc-cessful predictions, as
shown in Figure 5.
These trends reflect not the degree of co-evolu-tion of the
interacting partners, but rather the
intrinsic ambiguities in matching up trees in thisfashion. The
mutual information calculationaccounts for this trend, providing a
quantitativemeasure of the trees’ agreement with each other aswell
as their intrinsic complexity. With the mutualinformation scoring
technique, the importance oftree structure can be exploited to
improve predic-tions: the precise proteins included in an
analysis,or the organisms from which they derive, can bechosen to
maximize the phylogenetic trees’ mutualinformation, thereby
enhancing the accuracy ofpredicted interactions. Many of the 34
examplesin Table 2 represent just such experiments. Forexample,
matching all of the E. coli two-componentsensors against all of the
two-component regula-tors, produces a low mutual information
score(0.39) and a low prediction accuracy (7%), butlimiting the
analysis to the Cit-type regulator/sensor subfamilies results in
higher mutual infor-mation scores (0.77) and correspondingly
higheraccuracy (100%).
When the information content of the trees ishigh, the correct
interaction partners might beeasily predictable simply by examining
the trees.In practice, manual tree comparisons are oftennon-trivial
and provide no information about theconfidence to be placed in the
predictions, as illus-trated by the Gyr/Par trees of Figure 1(B).
Themutual information between these trees is quitehigh, even though
the topologies of the Gyr/Parsubtrees are identical to each other.
Finding inter-action partners by visual examination of the
treesrequires careful attention to subtle changes in thebranch
lengths. However, the matrix alignmentmethod offers an objective,
quantitative measureof the significance of the predicted
interactions.Most important, the approach is automated, allow-ing
it to be applied on a large-scale to many proteinfamilies.
Accompanying the matrix alignment algorithmis a new method,
termed 3D embedding, forvisualizing protein families and
interactionsbetween them. For one protein family, this
methodvisually summarizes the evolutionary relationshipsamong the
proteins. For two interacting proteinfamilies, these 3D embeddings
can be super-imposed, and the potential interaction partnerscan be
directly visualized. 3D embedding opensthe possibility of
rank-ordering predicted inter-action partners, such as by their
spatial distancefrom each other. The method potentially allowsthe
least squares alignment of two families on thebasis of known
protein interactions, followed bythe prediction of interactions
between the proteinsnot specifically used to generate the
alignment,allowing the analysis of protein families of
unequalsizes, and possibly even proteins with multiplebinding
partners.
Finally, the 3D embedding method illustrateshow matrix alignment
sometimes proceeds in asurprising fashion. As an example, it
correctlypairs the C. crescentus GyrA and GyrB proteins, inspite of
the fact that the two proteins sit in quite
Co-evolution of Interacting Proteins 281
-
dissimilar relationships to the rest of their respect-ive
families (Figure 4(B)). However, the interactionis presumably
predicted between the C. crescentusproteins because all other
protein pairs matchbetter, thereby forcing the C. crescentus
proteinstogether in spite of the poor fit.
A model for the evolution ofinteracting proteins
Proteins are constrained to maintain their inter-actions and
therefore have to co-evolve with theirinteraction partners.23
However, the fact that themethod presented here works illustrates
anadditional aspect of the evolution of interactingproteins: Two
models can be considered for theevolution of interacting proteins,
which contrastin the degree of coupling between the evolution
ofprotein interaction specificity and the ancestralgenetic events
producing protein families (specific-ally, we consider the case of
paralogs). Both modelsbegin with an ancestral pair of interacting
proteins.In the first model, the progenitor proteins areduplicated,
and the duplicated proteins (paralogs)are free to evolve new
interaction partners, suchas by mutation and selection. After
multipleduplications and evolution of new interactionspecificities,
two families of interacting proteinsresult such that the
correlation in position in thephylogenetic trees is lost between
pairs of paralogswith their corresponding interaction partners.In
short, when gene duplications precede theevolution of interaction
specificity, the phylo-genetic trees of the interaction partners
are nolonger alignable in the fashion of the treesexamined
here.
However, in an alternate model, interacting pro-tein partners
are duplicated in a correlated fashionthrough the course of
evolution. The interactionspecificity is maintained or created in a
processtightly coupled to the process of gene duplication.Only in
this case will the phylogenetic trees ofthe interacting protein
families be similar. Thedata presented here support this second
model,suggesting that interacting proteins in thesefamilies are not
simply duplicated and freed toevolve new interaction partners, but
rather thatinteracting partners are duplicated in coupled
pro-cesses leading to a measurable association betweenthe
specificity of protein interaction partners andthe genetic
relationships of their correspondinggenes.
Materials and Methods
Sequence alignments, similarity matrices, andphylogenetic
trees
Sequences from SwissProt24 were aligned using CLUS-TALW1.7.
Similarity matrices were calculated from themultiple sequence
alignment using CLUSTALW.25 Eachsimilarity matrix entry sij
represents the evolutionarydistance between a pair of proteins in a
sequence family
after corrections for multiple mutations per amino
acidresidue.26 Similarity matrices for pairs of interactingprotein
families were input to the MATRIX matrixalignment algorithm
described below. Unrooted phylo-genetic trees were calculated via
neighbor joining usingPHYLIP.27 Chemokine interactions were defined
asdescribed by Oppenheim and Feldmann.28 Other inter-actions were
assigned according to the KEGG database,version 22.0.21
Optimal alignment of similarity matrices
Pairs of similarity matrices were compared by theirroot mean
square difference (r.m.s.d.), calculated as:
rmsd
¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
nðn 2 1ÞXnj¼2
Xj21
i¼1ðaij 2 bijÞ2
vuut ;
where aij and bij represent equivalent elements of the
twosimilarity matrices, and n is the number of proteins ineach
family. Smaller r.m.s.d. indicates greater agreementbetween two
matrices.
To align matrices, the order of the rows in one matrix(and
therefore columns, as a matrix is symmetric) isoptimized with
simulated annealing29 to minimize ther.m.s.d. between matrices: One
similarity matrix (familyA in Figure 2) remains unchanged. In the
second simi-larity matrix (family B in Figure 2), pairs of rows
(andtheir symmetric columns) are randomly chosen andtheir elements
are swapped, evaluating the resultingchange in r.m.s.d. If r.m.s.d.
decreases, the swap is kept.If r.m.s.d. increases, the swap is kept
with a probabilityp proportional to an external control variable T;
suchthat p ¼ expð2d=TÞ; where d equals the increase inr.m.s.d. with
the swap. The control variable T is initial-ized such that p is
first set to 0.8; T is decreased linearlywith each iteration ðTnew
¼ 0:95ToldÞ: This process isiterated until the probability of
accepting an increase isless than 10%.
Following simulated annealing, interactions are pre-dicted
between proteins heading the correspondingrows of the two
similarity matrices. As the possiblenumber of re-ordered matrices
is factorial with the num-ber of proteins in the matrix, this
method does notguarantee the correct solution for large matrices
(.15proteins). In these cases, the protocol is repeated 100times,
and the frequency of occurrence of a given inter-acting protein
pair is calculated and tabulated in orderto test the
reproducibility of the predictions. Interactionsare then assigned
between the most frequent proteinpairings.
3D embedding of protein sequence families
Proteins were represented as mass-less points in spaceconnected
by springs whose equilibrium lengths wereequal to the proteins’
pair-wise similarities ðsijÞ: Eachprotein in a sequence family was
initially assigned to arandom position, then moved in an iterative
fashion tominimize the action of spring forces. At equilibrium,
theproteins are placed such that distances separating theproteins
ðdijÞ agree maximally with the similarities inthe similarity
matrix, except for the distortion inherentin mapping
high-dimensional relationships into three-dimensional space. Pairs
of interacting protein familiesvisualized in this fashion were
superimposed by rigidbody least squares fit of one family onto the
other usingSwissPDBViewer,30 minimizing the distance between
282 Co-evolution of Interacting Proteins
-
predicted or known interaction partners. Note that
thepossibility exists for positioning a set of proteins
inmirror-image embeddings, complicating alignmentof interacting
proteins. In practice, repeating theembedding to achieve compatible
handedness with theinteracting proteins can circumvent this
problem.
Simulations of the evolution of protein interactions
Pairs of amino acid sequences of length 300, represent-ing
ancestral interacting proteins (sequences 1A and 1B),were randomly
generated using naturally occurringamino acid frequencies. The
evolution of a sequencepair into two families of interacting
paralogs was thenmodeled by successive duplication, with mutation,
of aprotein from family A and the corresponding proteinfrom family
B, forcing parallel duplications in the twofamilies. Mutations were
randomly introduced at eachduplication with the amino acid
substitution frequenciesof a PAM25 substitution matrix,31 which has
the effect ofmutating ,25% of the amino acid residues per
proteinper duplication. In this manner, the underlying patternof
duplications is held constant between two families,and point
mutations in each sequence are modeled.
After a simulation, the family A sequences werealigned to each
other, as were the family B sequences.The similarity matrix for
each family was calculated(as for actual proteins) and matrix
alignment performed.Correct predictions were assigned between
equivalentproteins (e.g. pairing 1A to 1B, the first duplicate of
1Ato the first duplicate of 1B, etc.). Simulations wererepeated
with a parameter p0 controlling the choice ofancestor for each new
paralog, as described in the text.In Figure 5(A), simulations were
performed ten timesper data point plotted for protein families of
tenmembers; in Figure 5(B), 100 simulations per value of p0were
performed for a given family size, sampling fromp0 ¼ 0:0 to 1.0 in
0.1 increments.
Information theoretic-based measure of agreementbetween
phylogenetic trees
The agreement between pairs of phylogenetic treeswas calculated
using an information theory22-basedmetric, mutual information,
which accounts both for thesimilarity matrices’ agreement as well
as for their intrin-sic information content. The information
content of asimilarity matrix is assessed as the entropy HðxÞ of
thedistribution of values in the similarity matrix,
calculatedas:
HðxÞ ¼ 2X
x
pðxÞlog pðxÞ;
where x represents bins of values drawn from a simi-larity
matrix, and pðxÞ represents the frequency withwhich those values
are observed in the matrix. Giventwo similarity matrices, the
relative entropy Hðx; yÞrepresents the extent of their agreement,
calculated as:
Hðx; yÞ ¼ 2Xx;y
pðx; yÞlog pðx; yÞ;
where x; y represents bins of pairs of values in
equivalentpositions of the two similarity matrices, and pðx;
yÞrepresents the relative frequency with which pairs ofvalues are
observed in equivalent positions of the twomatrices.
The mutual information (MI) between two matrices,representing
their overall agreement, is calculated as:
MI ¼ HðxÞ þ HðyÞ2 Hðx; yÞ;
accounting both for the complexity of the phylogenetictrees (in
the HðxÞ and HðyÞ terms, which are larger withmore complex trees)
and their similarity (in the Hðx; yÞterm, which is smaller given
better agreement). A highmutual information score indicates a pair
of complexand mutually consistent phylogenetic trees.
Acknowledgements
The authors acknowledge support from grantF-1515 of the Welch
Foundation, the TexasAdvanced Research Program, the National
ScienceFoundation, and a Dreyfus New Faculty Awardand Packard
Fellowship for E.M.M.
References
1. Pruitt, K. D. & Maglott, D. R. (2001). RefSeq
andLocusLink: NCBI gene-centered resources. Nucl.Acids Res. 29,
137–140.
2. Hsu, S. Y., Nakabayashi, K., Nishi, S., Kumagai, J.,Kudo, M.,
Sherwood, O. D. & Hsueh, A. J. (2002).Activation of orphan
receptors by the hormonerelaxin. Science, 295, 671–674.
3. Saito, Y., Nothacker, H. P., Wang, Z., Lin, S. H., Leslie,F.
& Civelli, O. (1999). Molecular characterization ofthe
melanin-concentrating-hormone receptor. Nature,400, 265–269.
4. Chambers, J., Ames, R. S., Bergsma, D., Muir, A.,Fitzgerald,
L. R., Hervieu, G. et al. (1999). Melanin-concentrating hormone is
the cognate ligand for theorphan G-protein-coupled receptor SLC-1.
Nature,400, 261–265.
5. Sprinzak, E. & Margalit, H. (2001).
Correlatedsequence-signatures as markers of
protein–proteininteraction. J. Mol. Biol. 311, 681–692.
6. Pazos, F. & Valencia, A. (2002). In silico
two-hybridsystem for the selection of physically interactingprotein
pairs. Proteins: Struct. Funct. Genet. 47,219–227.
7. Lockless, S. & Ranganathan, R. (1999).
Evolutionarilyconserved pathways of energetic connectivity
inprotein families. Science, 286, 295–299.
8. Lichtarge, O., Bourne, H. & Cohen, F. (1996).
Anevoloutionary trace method defines binding surfacescommon to
protein families. J. Mol. Biol. 257,342–358.
9. Jones, S. & Thornton, J. (1997). Prediction of
protein–protein interaction sites using patch analysis. J.
Mol.Biol. 272, 133–143.
10. Huynen, M., Snel, B., Lathe, W. & Bork, P.
(2000).Predicting protein function by genomic context:quantitive
evaluation and qualitative inferences.Genome Res. 10,
1204–1210.
11. Dandekar, T., Snel, B., Huynen, M. & Bork, P.
(1998).Conservation of gene order: a finger print of proteinsthat
physically interact. Trends Biochem. Sci. 23,324–328.
12. Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G. D.&
Maltsev, N. (1999). The use of gene clusters to
Co-evolution of Interacting Proteins 283
-
infer functional coupling. Proc. Natl Acad. Sci.
96,2896–2901.
13. Enright, A., Iliopopulos, I., Kyrpides, N. & Ouzounis,C.
(1999). Protein interaction maps for completegenomes based on gene
fusion events. Nature, 402,86–90.
14. Marcotte, E., Pellegrini, M., Ng, H., Rice, D., Yeates,T.
& Eisenberg, D. (1999). Detecting protein functionand
protein-protein interactions from genomesequences. Science, 285,
751–753.
15. Pellegrini, M., Marcotte, E. M., Thompson, M. J.,Eisenberg,
D. & Yeates, T. O. (1999). Assigningprotein functions by
comparative genome analysis:protein phylogenetic profiles. Proc.
Natl Acad. Sci.96, 4285–4288.
16. Fryxell, K. J. (1996). The coevolution of gene familytrees.
Trends Genet. 12, 364–369.
17. Goh, C., Bogan, A., Joachimiak, M., Walther, D. &Cohen,
F. (2000). Co-evolution of proteins with theirinteraction partners.
J. Mol. Biol. 299, 283–293.
18. Koretke, K., Lupas, A., Warren, P., Rosenberg, M.
&Brown, J. (2000). Evolution of two-component
signaltransduction. Mol. Biol. Evol. 17, 1956–1970.
19. Pazos, F. & Valencia, A. (2001). Similarity of
phylo-genetic trees as indicator of protein-protein inter-action.
Protein Eng. 14, 609–614.
20. Hughes, A. L. & Yeager, M. (1999). Coevolution ofthe
mammalian chemokines and their receptors.Immunogenetics, 49,
115–124.
21. Kanehisa, M. (1996). Toward pathway engineering:a new
database of genetic and molecular pathways.Sci. Technol. Jpn, 59,
34–38.
22. Shannon, C. E. (1948). A mathematical theory
ofcommunication. Bell Syst. Tech. J. 27, 379–423.
23. Fraser, H. B., Hirsh, A. E., Steinmetz, L. M., Scharfe,C.
& Feldman, M. W. (2002). Evolutionary rate inprotein
interaction network. Science, 296, 750–752.
24. Bairoch, A. & Apweiler, R. (1997). The SWISS-PROTprotein
sequence data bank and its supplementTrEMBL. Nucl. Acids Res. 25,
31–36.
25. Thompson, J., Higgins, D. & Gibson, T. (1994).CLUSTAL W:
improving the sensitivity of progress-ive multiple sequence
alignment through sequenceweighting, position-specific gap
penalties andweight matrix choice. Nucl. Acids Res. 22,
4673–4680.
26. Kimura, M. (1983). The Natural Theory of MolecularEvolution,
Cambridge University Press.
27. Felsenstein, J. (1993). PHYLIP 3.5c (PhylogenyInference
Package), University of Washington, Seattle.
28. Oppenheim, J. J. & Feldmann, M. (2001).
Cytokinereference, a compendium of cytokines and othermediators of
host defense. Chemokine Reference,Academic Press, San Diego.
29. Kirkpatrick, S., Gelatt, C. D. & Vecchi, M. P.
(1983).Optimization by simulated annealing. Science,
220,671–680.
30. Guex, N., Diemant, A. & Peitsch, M. C. (1999).Protein
modelling for all. Trends Biochem. Sci. 24,364–367.
31. Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C.(1978).
Atlas of Protein Sequence and Structure(Dayhoff, M. O., ed.), vol.
5, pp. 345–352, NationalBiomedical Research Foundation, Washington
DC.
Edited by F. E. Cohen
(Received 17 September 2002; received in revised form 17
December 2002; accepted 10 January 2003)
284 Co-evolution of Interacting Proteins
Exploiting the Co-evolution of Interacting Proteins to Discover
Interaction SpecificityIntroductionResultsPrediction of
interactions by matrix alignmentMatching two-component sensors to
regulatorsVisualization of protein interaction partners by 3D
embeddingThe effects of phylogenetic tree structure on inferring
protein interactionsA score that quantitatively predicts the
accuracy of matrix alignment
DiscussionA model for the evolution of interacting proteins
Materials and MethodsSequence alignments, similarity matrices,
and phylogenetic treesOptimal alignment of similarity matrices3D
embedding of protein sequence familiesSimulations of the evolution
of protein interactionsInformation theoretic-based measure of
agreement between phylogenetic trees
AcknowledgementsReferences