Top Banner
Original research Coverage of protein domain families with structural proteineprotein interactions: Current progress and future trends Alexander Goncearenco, Benjamin A. Shoemaker, Dachuan Zhang, Alexey Sarychev, Anna R. Panchenko * Computational Biology Branch of the National Center for Biotechnology Information in Bethesda, Maryland, United States article info Article history: Available online 13 June 2014 Keywords: Proteineprotein interactions Protein binding sites Coverage of protein families PPI Structural genomics abstract Protein interactions have evolved into highly precise and regulated networks adding an immense layer of complexity to cellular systems. The most accurate atomistic description of protein binding sites can be obtained directly from structures of protein complexes. The availability of structurally characterized protein interfaces signicantly improves our understanding of interactomes, and the progress in struc- tural characterization of proteineprotein interactions (PPIs) can be measured by calculating the struc- tural coverage of protein domain families. We analyze the coverage of protein domain families (dened according to CDD and Pfam databases) by structures, structural proteineprotein complexes and unique protein binding sites. Structural PPI coverage of currently available protein families is about 30% without any signs of saturation in coverage growth dynamics. Given the current growth rates of domain data- bases and structural PPI deposition, complete domain coverage with PPIs is not expected in the near future. As a result of this study we identify families without any proteineprotein interaction evidence (listed on a supporting website http://www.ncbi.nlm.nih.gov/Structure/ibis/coverage/) and propose them as potential targets for structural studies with a focus on protein interactions. Published by Elsevier Ltd. 1. Introduction To understand the mechanisms of protein function one needs to explore proteins at the molecular level and at the same time analyze their intricate interactions at the interactome level. Although recent advances in experimental high-throughput (HTP) methods have produced unprecedented amounts of proteinepro- tein interaction data, current interactomedatasets still suffer from a high rate of false positives and low coverage. As a result of these drawbacks, a comprehensive protein interactome mapping for a given organism is still a daunting task with the large majority of proteineprotein interactions left to be determined (Venkatesan et al., 2009). Verication of obtained interactions is essential in order to avoid challenges associated with high-throughput studies and further propagation of interaction annotation errors. Structures from Protein Databank (PDB) (Dutta et al., 2009) and related da- tabases (Davis and Sali, 2005; Juettemann and Gerloff, 2011; Kundrotas and Alexov, 2007; Madej et al., 2012; Xu et al., 2006) may provide the most reliable atomic resolution data for protein molecules and their complexes. Ideally, high-throughput data on protein partnerships should be complemented with the details of binding site locations and physicochemical properties of interac- tion interfaces derived from structures. Since the rst structure of myoglobin was solved in 1958, a large number of protein structures have been resolved and deposited in the PDB (Dutta et al., 2009). Approximately half of these structures represent protein complexes where proteineprotein interfaces can be directly observed and protein-binding sites can be extracted. Comparative structural analyses of different protein complexes reveal a recurrence of certain sequence motifs and binding ar- rangements/modes on proteineprotein interfaces (Janin and Rodier, 1995; Jones et al., 2000; Shoemaker et al., 2006). Although binding arrangements evolve quite rapidly as proteins diverge, certain binding modes are conserved among homologs and in some cases even among non-homologous proteins (Aloy et al., 2003; Dayhoff et al., 2010; Keskin et al., 2004; Korkin et al., 2005; Zhang et al., 2010). Such conserved binding modes reect the ex- istence of characteristic features on binding interfaces which, in turn, may modulate binding by stabilizing complexes, by providing specic sites for recognition and/or for post-translational modi- cations (Bhaskara and Srinivasan, 2011; Hashimoto and Panchenko, 2010; Nishi et al., 2011; Reimand et al., 2012). Moreover, some * Corresponding author. E-mail address: [email protected] (A.R. Panchenko). Contents lists available at ScienceDirect Progress in Biophysics and Molecular Biology journal homepage: www.elsevier.com/locate/pbiomolbio http://dx.doi.org/10.1016/j.pbiomolbio.2014.05.005 0079-6107/Published by Elsevier Ltd. Progress in Biophysics and Molecular Biology 116 (2014) 187e193
7

Coverage of protein domain families with structural protein–protein interactions: Current progress and future trends

Apr 24, 2023

Download

Documents

Joseph Masdeu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Coverage of protein domain families with structural protein–protein interactions: Current progress and future trends

lable at ScienceDirect

Progress in Biophysics and Molecular Biology 116 (2014) 187e193

Contents lists avai

Progress in Biophysics and Molecular Biology

journal homepage: www.elsevier .com/locate/pbiomolbio

Original research

Coverage of protein domain families with structural proteineproteininteractions: Current progress and future trends

Alexander Goncearenco, Benjamin A. Shoemaker, Dachuan Zhang, Alexey Sarychev,Anna R. Panchenko*

Computational Biology Branch of the National Center for Biotechnology Information in Bethesda, Maryland, United States

a r t i c l e i n f o

Article history:Available online 13 June 2014

Keywords:Proteineprotein interactionsProtein binding sitesCoverage of protein familiesPPIStructural genomics

* Corresponding author.E-mail address: [email protected] (A.R. Pan

http://dx.doi.org/10.1016/j.pbiomolbio.2014.05.0050079-6107/Published by Elsevier Ltd.

a b s t r a c t

Protein interactions have evolved into highly precise and regulated networks adding an immense layer ofcomplexity to cellular systems. The most accurate atomistic description of protein binding sites can beobtained directly from structures of protein complexes. The availability of structurally characterizedprotein interfaces significantly improves our understanding of interactomes, and the progress in struc-tural characterization of proteineprotein interactions (PPIs) can be measured by calculating the struc-tural coverage of protein domain families. We analyze the coverage of protein domain families (definedaccording to CDD and Pfam databases) by structures, structural proteineprotein complexes and uniqueprotein binding sites. Structural PPI coverage of currently available protein families is about 30% withoutany signs of saturation in coverage growth dynamics. Given the current growth rates of domain data-bases and structural PPI deposition, complete domain coverage with PPIs is not expected in the nearfuture. As a result of this study we identify families without any proteineprotein interaction evidence(listed on a supporting website http://www.ncbi.nlm.nih.gov/Structure/ibis/coverage/) and proposethem as potential targets for structural studies with a focus on protein interactions.

Published by Elsevier Ltd.

1. Introduction

To understand the mechanisms of protein function one needs toexplore proteins at the molecular level and at the same timeanalyze their intricate interactions at the interactome level.Although recent advances in experimental high-throughput (HTP)methods have produced unprecedented amounts of proteinepro-tein interaction data, current ‘interactome’ datasets still suffer froma high rate of false positives and low coverage. As a result of thesedrawbacks, a comprehensive protein interactome mapping for agiven organism is still a daunting task with the large majority ofproteineprotein interactions left to be determined (Venkatesanet al., 2009). Verification of obtained interactions is essential inorder to avoid challenges associated with high-throughput studiesand further propagation of interaction annotation errors. Structuresfrom Protein Databank (PDB) (Dutta et al., 2009) and related da-tabases (Davis and Sali, 2005; Juettemann and Gerloff, 2011;Kundrotas and Alexov, 2007; Madej et al., 2012; Xu et al., 2006)may provide the most reliable atomic resolution data for protein

chenko).

molecules and their complexes. Ideally, high-throughput data onprotein partnerships should be complemented with the details ofbinding site locations and physicochemical properties of interac-tion interfaces derived from structures.

Since the first structure of myoglobinwas solved in 1958, a largenumber of protein structures have been resolved and deposited inthe PDB (Dutta et al., 2009). Approximately half of these structuresrepresent protein complexes where proteineprotein interfaces canbe directly observed and protein-binding sites can be extracted.Comparative structural analyses of different protein complexesreveal a recurrence of certain sequence motifs and binding ar-rangements/modes on proteineprotein interfaces (Janin andRodier, 1995; Jones et al., 2000; Shoemaker et al., 2006). Althoughbinding arrangements evolve quite rapidly as proteins diverge,certain bindingmodes are conserved among homologs and in somecases even among non-homologous proteins (Aloy et al., 2003;Dayhoff et al., 2010; Keskin et al., 2004; Korkin et al., 2005;Zhang et al., 2010). Such conserved binding modes reflect the ex-istence of characteristic features on binding interfaces which, inturn, may modulate binding by stabilizing complexes, by providingspecific sites for recognition and/or for post-translational modifi-cations (Bhaskara and Srinivasan, 2011; Hashimoto and Panchenko,2010; Nishi et al., 2011; Reimand et al., 2012). Moreover, some

Page 2: Coverage of protein domain families with structural protein–protein interactions: Current progress and future trends

A. Goncearenco et al. / Progress in Biophysics and Molecular Biology 116 (2014) 187e193188

binding sites are found to be promiscuous and involved in in-teractions with many different proteins (Nobeli et al., 2009), a keyfunctional feature of hubs in interaction networks. Several methodshave been developed that use such recurrent motifs to predictproteineprotein interactions and to annotate binding sites(Shoemaker et al., 2010, 2012; Tuncbag et al., 2011; Xu andDunbrack, 2011).

The progress in structural biology and the trends of PDB growth(Berman et al., 2013) are periodically reviewed (Montelione, 2012)and are typically evaluated by analyzing the structural coverage ofprotein domain families and superfamilies (Finn et al., 2014;Garcia-Serna et al., 2006; Marchler-Bauer et al., 2013; Mistryet al., 2013). While there seems to be an agreement that moststructural folds have been sampled and represented by the PDBstructures, it is still unclear how fully the proteineprotein bindingmode arrangements are characterized.

Here we consider protein domains as units of proteineproteininteractions and use strict criteria to define proteineprotein in-terfaces and binding sites. Based on the binding-site comparisonsstored in the IBIS database (Shoemaker et al., 2010, 2012) weidentify unique binding sites in protein domains involved in pro-teineprotein interactions. We comprehensively assess the diversityof protein interactions and binding sites in the context of theirgrowth in protein domain families over the last 55 years. Ouranalysis of unique binding site coverage within protein familiesfrom Conserved Domain Database (CDD) and Pfam was able toidentify families with no structural evidence of proteineproteininteractions. We provide a list of these families which could andshould be targeted by structural efforts.

2. Materials and methods

2.1. Dataset

We downloaded biological assemblies (so called biounits) andstructure deposition dates from the MMDB database (Madej et al.,2012). MMDB bioassembly data and deposition dates were inturn taken from the PDB database (Berman et al., 2000). The dif-ference in the number of chains between the asymmetric units(ASU) and biounits is shown in Supplementary Information Fig. S1.The deposition dates were used to reconstruct the growth ofstructural data. We did not consider structures that were obsoleteor have been revoked from the PDB. We downloaded ConservedDomains Database (CDD) (Marchler-Bauer et al., 2013) version 3.11with 9860 domainmodels curated at NCBI.We did not consider anyCDD models imported from other sources. In this paper the CDDmodels are called “families”. A CDD superfamily represents a set ofsimilar domain models; we only considered those superfamilieswhich contained at least one manually curated model. We alsodownloaded the Pfam-A curated subset of the Pfam databaseversion 27, which contains 14,831 families (Finn et al., 2014). Itshould be mentioned that not all Pfam families correspond toprotein domains; some of them correspond to short repeats andsequence motifs. The CDD domains were mapped on structuresusing the IBIS database (Shoemaker et al., 2010, 2012). The struc-tural mapping of Pfam families was downloaded from the Pfam FTPserver (ftp://ftp.sanger.ac.uk/pub/databases/Pfam/mappings/pdb_pfam_mapping.txt).

2.2. Identifying unique binding sites

We analyzed proteineprotein interactions between domainsaccording to the criteria implemented in IBIS database. Namely,CDD domains were mapped onto sequences of protein chains tocreate “footprint” regions. Proteineprotein interactions were

defined between two domains (footprint regions) from twodifferent chains if there were at least five contacting residues ineach domain within the distance of at most 4 Å between heavyatoms. We did not consider polypeptide chains with less than 20amino acids as interaction partners. A binding site of a domain wasdefined as a set of interacting residues on one side of the interactioninterface.

Then we collected domain footprint regions which were map-ped to the same CDD domain superfamily, structurally super-imposed them and ensured that the sufficient fraction of domainfootprints was structurally aligned. Subsequently, we clusteredtheir corresponding binding sites based on sequence and structuralsimilarity between the sites and sequence conservation profile ofbinding site residues (Shoemaker et al., 2010). Binding sites wereclustered by a hierarchical complete linkage clustering procedure.To choose the sliding cutoff to define clusters, we used a functionwhich maximizes the mean similarity of members within a clusterand minimizes the complexity of the description provided bycluster membership (number of bits required to describe the data)(Slonim et al., 2005). Binding sites and conserved binding siteclusters can be explored online in IBIS database (http://www.ncbi.nlm.nih.gov/Structure/ibis/ibis.cgi). As a result of such clustering, itbecomes possible to assess the uniqueness of each binding site.Binding sites from different clusters are called hereafter unique. Weconsider a binding site as being novel, if no similar binding sites(from the same binding site cluster) were available in PDB prior toits deposition date. For clarity, we illustrate proteineprotein in-terfaces, binding sites and binding site clusters in Fig. 1.

2.3. Analysis of domain database and structure growth

An average yearly growth rate of CDD database (1200 familiesper year) was estimated by counting the number of new familiesand superfamilies that were added during the last five years.However, we did not count families that were eventually removedfrom the database, and only considered NCBI-curated families andsuperfamilies containing at least one NCBI-curated CDD family. Weestimated the growth of Pfam database (900 new families per year)starting from Pfam release 23 in 2008 to release 27 in 2013. Moredetails on database growth and coverage growth rates are providedin Supplementary Information Table S1.

The structural coverage was analyzed for families in the mostrecent release of CDD 3.11 and Pfam 27 using the PDB depositiondates as a reference point allowing us to look back in time. Eachfamily was assigned two dates: (i) year when the first structurematching the family was deposited and (ii) year when the firststructure with PPI complex was deposited. Thenwe aggregated thenumber of covered families by year. The analysis of CDD super-families was done following the same logic, considering a super-family to be structurally covered when at least one of the familieshad a representative structure. We assigned CDD family annota-tions to structures using the most specific best matching familymodels. Due to the hierarchical nature of family classifications inCDD, the models representing the intermediate nodes in the hier-archymight appear as lacking structural coverage and therefore thenumber of CDD families without any structural representativesmaybe slightly overestimated.

3. Results

3.1. Unique binding sites and interfaces

We define proteineprotein interfaces based on the contactsbetween domains located on different chains in macromolecularassemblies. While the interface is a characteristic of a pair of

Page 3: Coverage of protein domain families with structural protein–protein interactions: Current progress and future trends

Fig. 1. Definition of a binding site. (a) Example of protein interfaces and binding sites in Camp-dependent protein kinase type I (PDB 3tnp). The structure is a heterotetramer wherechains C and F represent catalytic subunit while chains B and E represent regulatory subunits. The former subunit has one domain STKc_PKA (orange), the latter has two domainsbelonging to the same CAP_ED domain family (blue and magenta with the linkers between domains shown in gray). (b) Schematic representation of binding sites (encoded byshapes), each side of the interface is shown in different shade patterns. There are six interfaces; only three of them are unique (dashed lines labeled with numbers in brackets). Outof total 12 binding sites, six sites are unique within the complex (distinguished by shade). (c) Examples of two binding sites clusters show that there exist binding sites in othercomplexes similar to the sites in our example (two sites in CAP_ED domains in chain C). The clusters are shown as alignments of binding residues (the residue numbers in PDB 3tnpare shown as column names). The first row in the alignment corresponds to the 3tnp structure.

Page 4: Coverage of protein domain families with structural protein–protein interactions: Current progress and future trends

Fig. 2. Structural coverage and binding sites of CDD superfamilies. (a) Cumulative CDD superfamily coverage. Inset: lag in years between deposition of the first structure repre-senting the superfamily and the structure with at least one observed proteineprotein interaction. (b) Number of families in superfamily versus the number of binding sites. Thesuperfamilies are shown as circles. The size of the circle is proportional to the number of superfamilies (the largest circle contains 216 superfamilies, the smallest e one). The graydiagonal shows one-to-one correspondence between the number of sites and the number of families.

A. Goncearenco et al. / Progress in Biophysics and Molecular Biology 116 (2014) 187e193190

interacting proteins/domains, a protein-binding site describes eachinteraction partner. Therefore, we express the diversity of proteininteractions via the diversity of binding sites. Fig. 1 illustratesbinding sites and their conservation among homologous com-plexes. Camp-dependent protein kinase type I (PDB 3tnp) repre-sents a heterotetramer and consists of two pairs of identicalsubunits: catalytic (chains C and F, shown in orange) and regulatory(chains B and E shown in blue and magenta). Each regulatorysubunit consists of two domains from the same CAP_ED family(blue and magenta). We illustrate proteineprotein interfaces byshowing them in spheres, whereas the rest of the protein is shownas semitransparent surface (Fig. 1a). Each interface consists of twobinding sites and therefore is depicted in two colors.

Although there are six interfaces in the complex, only three ofthem are distinct or unique (shown as dashed lines with numbers1e3 on Fig. 1b). Consecutively, on each chain there are six uniquebinding sites (shown as different shapes with different shades).There can be binding sites in other protein complexes that aresimilar to the ones observed in the example structure. Fig. 1c showstwo clusters of protein binding sites from the regulatory subunit forinterfaces number 1 and 2. Domains from other protein complexesare structurally superimposed on domains from regulatory sub-units (those that cannot be structurally superimposed are dis-regarded) and the gaps (dashes) in the sequence alignment ofbinding site regions indicate those residues that are not structurallyaligned. Considering binding sites per domain is very importantbecause it provides spatially localized binding sites on each domain(separate binding site patches corresponding to different domainsin Fig. 1a and b) and distinct protein interfaces. If we consider awhole protein chain as a unit of interaction, there would be onlyone interface between STKc_PKA in chain C and two CAP_ED do-mains in chain B, which does not fully represent the nature of theinteraction between these two chains and makes it rather hard tocompare binding sites between different protein families wheredomain recombination is a common evolutionary event.

3.2. Growth dynamics of coverage of CDD superfamilies bystructural complexes

Here we analyze the coverage of CDD superfamilies bystructural complexes and unique binding sites since this

characteristic can be used as an indicator of success of structuralgenomics efforts. Domain coverage is defined as a fraction of alldomain families (or superfamilies) with at least one structure orstructural evidence of PPI with a well-defined binding site (seebinding site definitions in Methods). As shown in Fig. 2a, startingfrom the early 1990s the structural coverage has dramaticallyincreased with two thirds of all superfamilies exemplified by atleast one structure by the year 2000. The rate of improvement instructural coverage remains impressive due to the guiding effortsof structural genomics initiatives (Burley et al., 2008), reaching90% of coverage for currently available superfamilies. Fig. 2a alsoshows the growth of CDD superfamily coverage with proteine-protein interactions. As shown in the inset, in some cases thestructures are first solved as monomers and only later aredeposited as PPIs, thereby creating some significant delays be-tween monomer and assembly deposition dates (time lag in yearsis shown in Fig. 2a inset). It may seem that the current super-family coverage with PPIs of more than 75% should provide acomprehensive description of the diversity of binding sites.However, the protein binding sites diverge rather fast in evolu-tion and some of them might be characteristic for only specificprotein families. Certain superfamilies have very heterogeneousbinding sites while for others the binding sites and bindingpartners are much more conserved. In general, the diversity ofPPI binding sites in a superfamily should depend on the role ofproteineprotein interactions in its function. Therefore, in the nextsection we explore the association between the diversity of su-perfamilies and the number of binding sites.

3.3. Relationship between the diversity of CDD superfamilies andthe number of unique binding sites

Next we compared the number of CDD families in a given su-perfamily with the number of unique binding sites within thesuperfamily. Fig. 2b shows that there are many superfamilies(shown below the main diagonal), where the number of bindingsites is up to an order of magnitude higher than the number offamilies within a given superfamily. Moreover, there is a signifi-cant number of superfamilies with only one family but a largenumber of binding sites. All of these examples constitute familieswith multiple binding sites. It is consistent with the previous

Page 5: Coverage of protein domain families with structural protein–protein interactions: Current progress and future trends

A. Goncearenco et al. / Progress in Biophysics and Molecular Biology 116 (2014) 187e193 191

observation that paralogous proteins from the same family have atendency to bind different partners using different binding sites(Dessailly et al., 2013; Hamp and Rost, 2012; Lewis et al., 2012;Reid et al., 2010). This binding specialization could prevent theundesirable cross-talk between similar pathways involvingparalogs with different specificities. Such families might benefitfrom their classification into several functional domain sub-families. Superfamilies containing at least several differentdomain families but only one binding site presumably possess avery conserved interface. However, it could also be the result ofthe limited structural coverage of protein interfaces in differentfamilies. About 25% of the superfamilies have no structural evi-dence of proteineprotein interactions and could be potential tar-gets for structural genomics. Based on two parameters describedso far, we have arranged all superfamilies into 12 groups (Table 1).The lists of superfamilies from each group are available on thesupporting website, where we provide links to correspondingexamples from the CDD and IBIS databases.

3.4. Growth dynamics of coverage of CDD and Pfam families withstructural complexes

Since the number of families within one superfamily variesgreatly, we calculated the structural coverage on the domain familylevel (Fig. 3a). While about 75% of the superfamilies are covered bystructures of complexes (Fig. 2), about 45% of all families within onesuperfamily are covered by at least one structural complex. Asevident from an inset of Fig. 3a superfamilies with large number offamilies are not well covered neither by structures nor by structuralcomplexes. This is especially pronounced for PPI data, where theaverage coverage barely reaches 20% in large superfamilies.

Considering the large coverage variance within superfamiliesfor the CDD database, we decided to compare CDD coverage perdomain level with another widely used high-quality domainannotation set from the Pfam database. In terms of absolutenumber, the coverage trends for CDD and Pfam agree surprisinglywell with each other. Namely, the average rates of structuralcoverage for CDD and Pfam families are 218 CDD and 210 Pfamfamilies per year respectively. Despite different philosophies andconstruction methods employed in these databases, similar trendpersists for the coverage by structural protein complexes (157 CDDand 161 Pfam families are covered by structural protein complexesper year). This similarity is suggestive of the robustness of our rateestimates. However, if we consider the fractional structuralcoverage (a fraction of all CDD and Pfam families with structuralevidence), there are certain differences between CDD and Pfam(Fig. 3b) which can be explained by different sizes of these data-bases (Table 2). As one can see from comparing Fig. 3a and b, thegrowth dynamics of CDD domain structural coverage is slower fordomains compared to superfamilies, reaching 50% of structuralcoverage and 30% of PPI coverage by families.

We identified families and superfamilies that are largely un-derrepresented by solved structural complexes and unique binding

Table 1The number of CDD superfamilies grouped by the number of families (rows) andproteineprotein interaction binding sites (columns).

Number of familiesper superfamily

Total numberof superfamilies

Number of binding sites per superfamily

No knownsites

1e10sites

11e100sites

101þsites

1e10 874 262 421 178 1311e100 170 9 53 93 15101þ 16 0 1 10 5Total 1060 271 457 281 33

interfaces. Such proteins could serve as potential targets for futurestructural discoveries. We provide the complete lists of families onthe supplementary website (http://www.ncbi.nlm.nih.gov/Structure/ibis/coverage/). The families without any structures areprobably first on the priority list, followed by the families withstructures and without the evidence of protein interactions. Fam-ilies in large superfamilies are of particular interest, because theyare likely to have diverse binding sites.

4. Discussion

The ultimate objective of improving the structural coverage ofproteineprotein interfaces and binding sites is an understanding ofmolecular mechanisms of protein function and protein recognition.Characterization and classification of structures of protein com-plexes and proteineprotein binding sites is at least as important asthe classification of folds. Now, when most of the folds are alreadyexemplified by experimentally determined structures and morethan 70% of sequences can be structurally modeled at least in part(Levitt, 2009), the focus in structural biology gradually shifts to-wards the characterization of protein assemblies, especially as-semblies with novel protein interfaces. Novel complexes andinterfaces are especially important for drug design of proteininteraction inhibitors and for the rational protein design to createprotein complexes with novel specificities (Huang et al., 2007;Khare and Fleishman, 2013).

Interactomes represent the networks characterizing all poten-tial proteineprotein interactions of a given species. The unit ofinteraction can correspond to a protein or a domain and theavailability of high-quality structural information on proteinepro-tein and domainedomain interactions is critical for building reli-able interactomes. Here we analyze how the structural and PPIcoverage of domain families change over time. We use currentlyavailable sets of domain families and trace back in time theirstructural coverage employing the structure deposition dates.

Given that roughly 30% of currently available domain familieshave at least one structural PPI representative, we investigatedwhether the current growth (well approximated by a linearfunction over the last ten years) can be extrapolated to find thepoint in time with near-complete structural coverage of alldomain families. Certainly, we do not expect that all domainfamilies will be covered by structures and/or structural com-plexes. It is a rather unrealistic assumption given that there arefamilies which do not function while interacting with otherprotein partners and there are families comprising intrinsicallydisordered proteins which are not present in PDB. If we take intoaccount the growth rate of domain family databases (about 1200new CDD and 900 new Pfam domain families added per year onaverage), we can estimate that about 3900 CDD families will haveat least one structural PPIs by the year 2020 (2912 CDD familiesare currently covered by PPIs). If the growth rate persists, theCDD and Pfam databases may grow to approximately 17,000 and21,000 families by 2020, respectively. In addition, many struc-tures of proteins with unknown functions are currently beingsolved, which may provide an evidence for new CDD or Pfamfamilies (Marchler-Bauer et al., 2013; Mistry et al., 2013). There-fore, given a current 30% structural PPI coverage and a 23% pro-jected PPI coverage by the year 2020, we do not foresee in thenear future any considerable increase in the percentage of fam-ilies with structural PPIs. This conclusion only holds true if thecurrent domain and structure deposition rates do not change.However, we might expect that the domain family databases willsoon reach a saturation point and will not grow anymore, while,with the help of structural initiatives, the sampling of differentPPIs will continue to increase. Right now, as protein domain

Page 6: Coverage of protein domain families with structural protein–protein interactions: Current progress and future trends

Fig. 3. Structural coverage of CDD and Pfam families. (a) Cumulative average coverage of CDD families within the CDD superfamilies with structural data and with structuralevidence of protein interactions; the error bars show standard errors. An inset shows the coverage (for the year of 2013) for different groups of superfamilies depending on thenumber of families in them. Superfamilies with more families tend to have worse coverage, even though almost half of the families in large superfamilies have structures, PPI data isavailable for only 15e20% of them. (b) Growth of coverage of CDD (solid line) and Pfam families (dashed line) in terms of structures (blue) and protein interactions (orange).

Table 2Coverage of CDD superfamilies and CDD/Pfam families with structures and struc-tural proteineprotein interactions.

Total With structureand PPI

With structure,no PPI

Withoutstructure

CDD superfamilies 1060 794 133 133CDD families 9860 2912 1838 5110Pfam families 14,831 3740 2009 9082

A. Goncearenco et al. / Progress in Biophysics and Molecular Biology 116 (2014) 187e193192

family databases continue to grow, complete structural PPIcoverage of domain families remains a moving target.

5. Conclusions

The availability of structurally characterized protein interfaces iscritical and significantly improves the reliability of interactomes.We analyze the coverage of protein domain families and super-families with structures and structural PPIs and assess their growthdynamics. We show that protein interaction coverage of domainfamilies is lagging behind structural coverage. While the overallnumber of possible protein interactions is unknown, the PPIcoverage of protein domain families with structural data serves as agood measure of progress. Currently, only about 30% of proteinfamilies have structural PPI coverage, and this number is not likelyto improve in the nearest future due to high growth rate in thenumber of protein domain families. Finally, we identify families andsuperfamilies without proteineprotein interaction evidence andfamilies without any structural data (listed on a supporting websitehttp://www.ncbi.nlm.nih.gov/Structure/ibis/coverage/). Thesefamilies could be used as potential targets for structural initiativeswith the focus on protein interactions.

Acknowledgments

We thank Thomas Madej and Christopher Lanczycki for helpwith the MMDB database. This work was supported by the Intra-mural Research Program of the National Library of Medicine at theU.S. National Institutes of Health.

Appendix A. Supplementary data

Supplementary data related to this article can be found at http://dx.doi.org/10.1016/j.pbiomolbio.2014.05.005.

References

Aloy, P., Ceulemans, H., Stark, A., Russell, R.B., 2003. The relationship betweensequence and interaction divergence in proteins. J. Mol. Biol. 332, 989e998.

Berman, H.M., Coimbatore Narayanan, B., Di Costanzo, L., Dutta, S., Ghosh, S.,Hudson, B.P., Lawson, C.L., Peisach, E., Prlic, A., Rose, P.W., Shao, C., Yang, H.,Young, J., Zardecki, C., 2013. Trendspotting in the Protein Data Bank. FEBS Lett.587, 1036e1045.

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H.,Shindyalov, I.N., Bourne, P.E., 2000. The Protein Data Bank. Nucleic Acids Res. 28,235e242.

Bhaskara, R.M., Srinivasan, N., 2011. Stability of domain structures in multi-domainproteins. Sci. Rep. 1, 40.

Burley, S.K., Joachimiak, A., Montelione, G.T., Wilson, I.A., 2008. Contributions to theNIH-NIGMS protein structure initiative from the PSI production centers.Structure 16, 5e11.

Davis, F.P., Sali, A., 2005. PIBASE: a comprehensive database of structurally definedprotein interfaces. Bioinformatics 21, 1901e1907.

Dayhoff, J.E., Shoemaker, B.A., Bryant, S.H., Panchenko, A.R., 2010. Evolution ofprotein binding modes in homooligomers. J. Mol. Biol. 395, 860e870.

Dessailly, B.H., Dawson, N.L., Mizuguchi, K., Orengo, C.A., 2013. Functional siteplasticity in domain superfamilies. Biochim. Biophys. Acta 1834, 874e889.

Dutta, S., Burkhardt, K., Young, J., Swaminathan, G.J., Matsuura, T., Henrick, K.,Nakamura, H., Berman, H.M., 2009. Data deposition and annotation at theworldwide protein data bank. Mol. Biotechnol. 42, 1e13.

Finn, R.D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Heger, A.,Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E.L., Tate, J., Punta, M., 2014.Pfam: the protein families database. Nucleic Acids Res. 42 (D1), D222eD230.

Garcia-Serna, R., Opatowski, L., Mestres, J., 2006. FCP: functional coverage of theproteome by structures. Bioinformatics 22, 1792e1793.

Hamp, T., Rost, B., 2012. Alternative protein-protein interfaces are frequent excep-tions. PLoS Comput Biol. 8, e1002623.

Hashimoto, K., Panchenko, A.R., 2010. Mechanisms of protein oligomerization, thecritical role of insertions and deletions in maintaining different oligomericstates. Proc. Natl. Acad. Sci. U S A 107, 20352e20357.

Huang, P.S., Love, J.J., Mayo, S.L., 2007. A de novo designed proteineprotein inter-face. Protein Sci. 16, 2770e2774.

Janin, J., Rodier, F., 1995. Protein-protein interaction at crystal contacts. Proteins 23,580e587.

Jones, S., Marin, A., Thornton, J.M., 2000. Protein domain interfaces: characteriza-tion and comparison with oligomeric protein interfaces. Protein Eng. 13, 77e82.

Juettemann, T., Gerloff, D.L., 2011. BISC: binary subcomplexes in proteins database.Nucleic Acids Res. 39, D705eD711.

Keskin, O., Tsai, C.J., Wolfson, H., Nussinov, R., 2004. A new, structurally nonre-dundant, diverse data set of protein-protein interfaces and its implications.Protein Sci. 13, 1043e1055.

Page 7: Coverage of protein domain families with structural protein–protein interactions: Current progress and future trends

A. Goncearenco et al. / Progress in Biophysics and Molecular Biology 116 (2014) 187e193 193

Khare, S.D., Fleishman, S.J., 2013. Emerging themes in the computational design ofnovel enzymes and proteineprotein interfaces. FEBS Lett. 587, 1147e1154.

Korkin, D., Davis, F.P., Sali, A., 2005. Localization of protein-binding sites withinfamilies of proteins. Protein Sci. 14, 2350e2360.

Kundrotas, P.J., Alexov, E., 2007. PROTCOM: searchable database of protein com-plexes enhanced with domain-domain structures. Nucleic Acids Res. 35,D575eD579.

Levitt, M., 2009. Nature of the protein universe. Proc. Natl. Acad. Sci. U S A 106,11079e11084.

Lewis, A.C., Jones, N.S., Porter, M.A., Deane, C.M., 2012. What evidence is there forthe homology of protein-protein interactions? PLoS Comput Biol. 8, e1002645.

Madej, T., Addess, K.J., Fong, J.H., Geer, L.Y., Geer, R.C., Lanczycki, C.J., Liu, C., Lu, S.,Marchler-Bauer, A., Panchenko, A.R., Chen, J., Thiessen, P.A., Wang, Y., Zhang, D.,Bryant, S.H., 2012. MMDB: 3D structures and macromolecular interactions.Nucleic Acids Res. 40, D461eD464.

Marchler-Bauer, A., Zheng, C., Chitsaz, F., Derbyshire, M.K., Geer, L.Y., Geer, R.C.,Gonzales, N.R., Gwadz, M., Hurwitz, D.I., Lanczycki, C.J., Lu, F., Lu, S.,Marchler, G.H., Song, J.S., Thanki, N., Yamashita, R.A., Zhang, D., Bryant, S.H.,2013. CDD: conserved domains and protein three-dimensional structure.Nucleic Acids Res. 41, D348eD352.

Mistry, J., Kloppmann, E., Rost, B., Punta, M., 2013. An estimated 5% of new proteinstructures solved today represent a new Pfam family. Acta Crystallogr. D. Biol.Crystallogr. 69, 2186e2193.

Montelione, G.T., 2012. The protein structure initiative: achievements and visionsfor the future. F1000 Biol. Rep. 4, 7.

Nishi, H., Hashimoto, K., Panchenko, A.R., 2011. Phosphorylation in protein-proteinbinding: effect on stability and function. Structure 19, 1807e1815.

Nobeli, I., Favia, A.D., Thornton, J.M., 2009. Protein promiscuity and its implicationsfor biotechnology. Nat. Biotechnol. 27, 157e167.

Reid, A.J., Ranea, J.A., Orengo, C.A., 2010. Comparative evolutionary analysis ofprotein complexes in E. coli and yeast. BMC Genom. 11, 79.

Reimand, J., Hui, S., Jain, S., Law, B., Bader, G.D., 2012. Domain-mediated proteininteraction prediction: from genome to network. FEBS Lett. 586, 2751e2763.

Shoemaker, B.A., Panchenko, A.R., Bryant, S.H., 2006. Finding biologically relevantprotein domain interactions: conserved binding mode analysis. Protein Sci. 15,352e361.

Shoemaker, B.A., Zhang, D., Thangudu, R.R., Tyagi, M., Fong, J.H., Marchler-Bauer, A.,Bryant, S.H., Madej, T., Panchenko, A.R., 2010. Inferred biomolecular interactionserver e a web server to analyze and predict protein interacting partners andbinding sites. Nucleic Acids Res. 38, D518eD524.

Shoemaker, B.A., Zhang, D., Tyagi, M., Thangudu, R.R., Fong, J.H., Marchler-Bauer, A.,Bryant, S.H., Madej, T., Panchenko, A.R., 2012. IBIS (Inferred BiomolecularInteraction Server) reports, predicts and integrates multiple types of conservedinteractions for proteins. Nucleic Acids Res. 40, D834eD840.

Slonim, N., Atwal, G.S., Tkacik, G., Bialek, W., 2005. Information-based clustering.Proc. Natl. Acad. Sci. U S A 102, 18297e18302.

Tuncbag, N., Gursoy, A., Nussinov, R., Keskin, O., 2011. Predicting protein-proteininteractions on a proteome scale by matching evolutionary and structuralsimilarities at interfaces using PRISM. Nat. Protoc. 6, 1341e1354.

Venkatesan, K., Rual, J.F., Vazquez, A., Stelzl, U., Lemmens, I., Hirozane-Kishikawa, T.,Hao, T., Zenkner, M., Xin, X., Goh, K.I., Yildirim, M.A., Simonis, N., Heinzmann, K.,Gebreab, F., Sahalie, J.M., Cevik, S., Simon, C., de Smet, A.S., Dann, E., Smolyar, A.,Vinayagam, A., Yu, H., Szeto, D., Borick, H., Dricot, A., Klitgord, N., Murray, R.R.,Lin, C., Lalowski, M., Timm, J., Rau, K., Boone, C., Braun, P., Cusick, M.E., Roth, F.P.,Hill, D.E., Tavernier, J., Wanker, E.E., Barabasi, A.L., Vidal, M., 2009. An empiricalframework for binary interactome mapping. Nat. Methods 6, 83e90.

Xu, Q., Canutescu, A., Obradovic, Z., Dunbrack Jr., R.L., 2006. ProtBuD: a database ofbiological unit structures of protein families and superfamilies. Bioinformatics22, 2876e2882.

Xu, Q., Dunbrack Jr., R.L., 2011. The protein common interface database (ProtCID) e acomprehensive database of interactions of homologous proteins in multiplecrystal forms. Nucleic Acids Res. 39, D761eD770.

Zhang, Q.C., Petrey, D., Norel, R., Honig, B.H., 2010. Protein interface conservationacross structure space. Proc. Natl. Acad. Sci. U S A 107, 10896e10901.