Top Banner
PERSPECTIVE Towards a unied open access dataset of molecular interactions Pablo Porras 1 , Elisabet Barrera 1 , Alan Bridge 2 , Noemi del-Toro 1 , Gianni Cesareni 3,4 , Margaret Duesbury 1,5 , Henning Hermjakob 1 , Marta Iannuccelli 3 , Igor Jurisica 6,7,8 , Max Kotlyar 6 , Luana Licata 3 , Ruth C. Lovering 9 , David J. Lynn 10,11 , Birgit Meldal 1 , Bindu Nanduri 12 , Kalpana Paneerselvam 1 , Simona Panni 13 , Chiara Pastrello 6 , Matteo Pellegrini 14 , Livia Perfetto 1 , Negin Rahimzadeh 5 , Prashansa Ratan 5 , Sylvie Ricard-Blum 15 , Lukasz Salwinski 5 , Gautam Shirodkar 5 , Anjalia Shrivastava 1,16 & Sandra Orchard 1 The International Molecular Exchange (IMEx) Consortium provides scientists with a single body of experimentally veried protein interactions curated in rich contextual detail to an internationally agreed standard. In this update to the work of the IMEx Consortium, we discuss how this initiative has been working in practice, how it has ensured database sus- tainability, and how it is meeting emerging annotation challenges through the introduction of new interactor types and data formats. Additionally, we provide examples of how IMEx data are being used by biomedical researchers and integrated in other bioinformatic tools and resources. P roteins do not function in isolation, but rather operate through complex networks of interactions with other molecules. To truly understand cell signalling, it is necessary to get a handle on the temporal and spatial contexts within which molecular interactions occur, and how signals ow through the resulting dynamic molecular networks. Perturbations in this information ow often result in disease. We therefore also need to recognise how molecular interaction networks are recongured in disease, for example, via nonsynonymous mutations that alter the interactions of mutant proteins with other molecules. https://doi.org/10.1038/s41467-020-19942-z OPEN 1 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Campus, Hinxton, Cambridge CB10 1SD, UK. 2 SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, CH-1211 Geneva, Switzerland. 3 University of Rome Tor Vergata, Rome, Italy. 4 IRCCS Fondazione Santa Lucia, 00143 Rome, Italy. 5 UCLA-DOE Institute, University of California, Los Angeles, CA 90095, USA. 6 Osteoarthritis Research Program, Division of Orthopedic Surgery, Schroeder Arthritis Institute, and Krembil Research Institute, University Health Network, 60 Leonard Avenue, 5KD-407, Toronto, ON M5T 0S8, Canada. 7 Departments of Medical Biophysics, and Computer Science, University of Toronto, Toronto, ON, Canada. 8 Institute of Neuroimmunology, Slovak Academy of Sciences, Bratislava, Slovakia. 9 Functional Gene Annotation, Preclinical and Fundamental Science, UCL Institute of Cardiovascular Science, University College London, London WC1E 6JF, UK. 10 Computational and Systems Biology Program, Precision Medicine Theme, South Australian Health and Medical Research Institute, Adelaide, SA 5000, Australia. 11 College of Medicine and Public Health, Flinders University, Bedford Park, SA 5042, Australia. 12 Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University, Starkville, MS, USA. 13 Università della Calabria, Dipartimento di Biologia, Ecologia e Scienze della Terra, Via Pietro Bucci Cubo 6/C, Rende, CS, Italy. 14 Department of Molecular, Cell and Developmental Biology, UCLA, Box 951606Los Angeles, CA 90095-1606, USA. 15 ICBMS, UMR 5246 University Lyon 1 - CNRS, Univ. Lyon, 69622 Villeurbanne, France. 16 Open Targets, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK. email: [email protected] NATURE COMMUNICATIONS | (2020)11:6144 | https://doi.org/10.1038/s41467-020-19942-z | www.nature.com/naturecommunications 1 1234567890():,;
12

Towards a unified open access dataset of molecular interactions

Feb 22, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Towards a unified open access dataset of molecular interactions

PERSPECTIVE

Towards a unified open access dataset ofmolecular interactionsPablo Porras 1, Elisabet Barrera1, Alan Bridge2, Noemi del-Toro 1,

Gianni Cesareni 3,4, Margaret Duesbury1,5, Henning Hermjakob 1,

Marta Iannuccelli 3, Igor Jurisica6,7,8, Max Kotlyar6, Luana Licata 3,

Ruth C. Lovering 9, David J. Lynn 10,11, Birgit Meldal 1, Bindu Nanduri12,

Kalpana Paneerselvam1, Simona Panni 13, Chiara Pastrello 6,

Matteo Pellegrini14, Livia Perfetto1, Negin Rahimzadeh5, Prashansa Ratan5,

Sylvie Ricard-Blum 15, Lukasz Salwinski 5, Gautam Shirodkar5,

Anjalia Shrivastava 1,16 & Sandra Orchard 1✉

The International Molecular Exchange (IMEx) Consortium provides scientists with a single

body of experimentally verified protein interactions curated in rich contextual detail to an

internationally agreed standard. In this update to the work of the IMEx Consortium, we

discuss how this initiative has been working in practice, how it has ensured database sus-

tainability, and how it is meeting emerging annotation challenges through the introduction of

new interactor types and data formats. Additionally, we provide examples of how IMEx data

are being used by biomedical researchers and integrated in other bioinformatic tools and

resources.

Proteins do not function in isolation, but rather operate through complex networks ofinteractions with other molecules. To truly understand cell signalling, it is necessary to geta handle on the temporal and spatial contexts within which molecular interactions occur,

and how signals flow through the resulting dynamic molecular networks. Perturbations in thisinformation flow often result in disease. We therefore also need to recognise how molecularinteraction networks are reconfigured in disease, for example, via nonsynonymous mutationsthat alter the interactions of mutant proteins with other molecules.

https://doi.org/10.1038/s41467-020-19942-z OPEN

1 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Campus, Hinxton, Cambridge CB10 1SD, UK. 2 SIBSwiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, CH-1211 Geneva, Switzerland. 3 University of Rome Tor Vergata,Rome, Italy. 4 IRCCS Fondazione Santa Lucia, 00143 Rome, Italy. 5 UCLA-DOE Institute, University of California, Los Angeles, CA 90095, USA.6Osteoarthritis Research Program, Division of Orthopedic Surgery, Schroeder Arthritis Institute, and Krembil Research Institute, University Health Network,60 Leonard Avenue, 5KD-407, Toronto, ON M5T 0S8, Canada. 7 Departments of Medical Biophysics, and Computer Science, University of Toronto, Toronto,ON, Canada. 8 Institute of Neuroimmunology, Slovak Academy of Sciences, Bratislava, Slovakia. 9 Functional Gene Annotation, Preclinical and FundamentalScience, UCL Institute of Cardiovascular Science, University College London, London WC1E 6JF, UK. 10 Computational and Systems Biology Program,Precision Medicine Theme, South Australian Health and Medical Research Institute, Adelaide, SA 5000, Australia. 11 College of Medicine and Public Health,Flinders University, Bedford Park, SA 5042, Australia. 12 Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University, Starkville, MS,USA. 13 Università della Calabria, Dipartimento di Biologia, Ecologia e Scienze della Terra, Via Pietro Bucci Cubo 6/C, Rende, CS, Italy. 14 Department ofMolecular, Cell and Developmental Biology, UCLA, Box 951606Los Angeles, CA 90095-1606, USA. 15 ICBMS, UMR 5246 University Lyon 1 - CNRS, Univ.Lyon, 69622 Villeurbanne, France. 16Open Targets, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK. ✉email: [email protected]

NATURE COMMUNICATIONS | (2020) 11:6144 | https://doi.org/10.1038/s41467-020-19942-z | www.nature.com/naturecommunications 1

1234

5678

90():,;

Page 2: Towards a unified open access dataset of molecular interactions

Studying the interactome ― the set of all intermolecularinteractions within a cell ― enables researchers to interrogatethe functional consequences of variation, and gain insight intodisease processes1. However, interactome descriptions currentlysuffer from two fundamental problems: noise (false positives) andlack of coverage (false negatives and unexplored interactomespace)2,3. In the early days of technique development there wereserious concerns about the reliability of methods such as proteincomplementation assays, exemplified by yeast 2-hybrid, or affi-nity purification techniques4, but these technical issues havelargely been overcome. Nowadays experts in the field are moreconcerned with a lack of understanding of to what extent givenprotein–protein interactions (PPIs) are determined by the specifictissues, cell types or experimental conditions under which theyare observed3.

Various techniques are used to identify PPIs, which oftendetect different subsets of the interactions potentially occurringwithin the same targeted interaction space5,6. Methodologicalissues may thus partly explain the frequent lack of significantoverlap between large-scale PPI datasets. Therefore, in order tocorrectly interpret PPI data, it is important to understand thecontext in which the data was collected. This includes not onlythe experimental technique and the type of relationship it willdetect (such as direct binding between two partners identified byX-ray crystallography, affinity purification of multiple proteinscomplexes or colocalization of two proteins in the same cellularcompartment) but also the experimental conditions and mod-ifications made to the participating molecules such as affinity tagsor sequence mutations. All these metadata will impact the com-position of the interactome generated. Therefore, this metadataneed to be recorded in a computationally accessible manner,enabling researchers to make informed decisions as to the qualityof data they are working with. The International MolecularExchange (IMEx) Consortium7 was formed in 2005 with the goalto provide users with a dataset enhanced with controlled voca-bulary (CV) terms to enable scoring, filtering and sophisticatedsearching of the information.

Here, we review the advances made in curation practices anddata formats since the IMEx Consortium was first described in20127. We illustrate additional ways in which the data can bescored and filtered; and describe use cases where researchers havemoved beyond simple high-coverage, gene-centric networks touse the additional level of detail provided by IMEx Consortiumdata in both analysis and visualisation.

The IMEx consortiumThe IMEx Consortium is open to any group or resource inter-ested in curating physical molecular interactions, current mem-bers include IntAct8, MINT8, DIP9, UniProt10, MatrixDB11,InnateDB12, HPIDB13, UCL Functional Gene Annotation teamand IID14. The consortium comprises the majority of the existingdatabase resources who have agreed to collaborate on the cura-tion of published, experimentally derived interaction data. TheIMEx Consortium members have agreed on a set of curation rulesand map interaction data to a limited set of defined moleculeidentifiers to provide the user with a single and consistent dataset,with each interaction being assigned a unique and persistentidentifier.

While the IMEx Consortium is a global effort with contributingmembers from Europe, North America and Australasia, itscommon rigorous curation rules and standards have allowedIMEx to be selected as one of the Core Data Resources of theEuropean Life Sciences Infrastructure for Biological Information(ELIXIR)15, which are considered essential for the long-termpreservation of biological data. At the same time the IMEx

Consortium continues to provide an enhanced service to bothresearch funders and data users16.

The IMEx data distribution modelThe formation of the IMEx Consortium was a natural progressionfrom the work of the Molecular Interaction (MI) working groupof the Human Proteome Organization Proteomics StandardsInitiative (HUPO-PSI). This group has developed the now widelyimplemented PSI-MI XML2.5 data interchange format17. Theyrecently published an update to enable the description of morecomplex data types such as cooperative/allosteric interactions anddynamic interactions (PSI-MI XML3.0)18, and also producedsimpler, tab-delimited representations (MITAB), which can bemore rapidly parsed or downloaded. In addition to a tool suiteand libraries designed to utilise these formats, HUPO-PSImaintains the associated MI CV (www.ebi.ac.uk/ols/ontologies/mi) that contains terms to describe all aspects of an interactionrecord. All IMEx data are made publicly available in the HUPO-PSI standard formats, making them Findable, Accessible, Inter-operable and Reusable (FAIR)19.

The initial data distribution model consisted of interactiondatabases retaining their data locally and making their IMExdataset available through the Proteomics Standard InitiativeCommon QUery InterfaCe (PSICQUIC)20,21 (see also section“TOOLS TO VISUALIZE AND QUERY IMEx DATA” below).However, even at the IMEx website, each resource’s records werelisted separately and users had to cluster the results of their searchto merge different evidences for the same interacting pair ofmolecules. Although a tool to enable this was supplied at the site,it was restricted to operating on <5000 records.

As database infrastructure funding has become more difficultto obtain, members of the IMEx Consortium agreed 3 years agoto centralise their IMEx-compliant data-storage and curationefforts in the IntAct database maintained at the EMBL-EuropeanBioinformatics Institute (EBI). This enables resources to con-centrate on the curation effort, rather than development ofcuration pipelines and annotation tools16, and also increasescuration consistency. Members enter data through a web-basededitorial platform designed to allow collaborative curation byphysically remote partners. A sophisticated institute managermodule links individual curators to their resource and/or fundingbody to enable full accreditation of the curation effort. The IntActteam is responsible for updating the data and producing a regulardata release. The full IMEx dataset is made publicly availableunder an open Creative Commons Attribution 4.0 Internationallicence (CC-BY4.0) as a single PSICQUIC service that can beaccessed and searched via the IMEx Consortium website (ww.imexconsortium.org) and any other resource implementing theIMEx PSICQUIC webservice. These include the IntAct website(www.ebi.ac.uk/intact/) and the mentha web resource22, main-tained by the MINT group.

Member databases may import some or all of the IMEx datasetback into their own resources. Partners may add other informa-tion to the IMEx dataset on their own websites or choose to onlyprovide access to a subset of data that is of interest to theirspecific resource. For example, IID complements the IMEx datawith interactions predicted by multiple machine learning anddata mining algorithms, tissue and disease annotation con-text23,24, while MatrixDB only provides IMEx data pertaining toextracellular matrix proteins and glycosaminoglycans25. Thismodel enables not only large interaction databases to contributeto the overall effort of IMEx but also allows annotation ofmolecular interactions by groups in data resources that do notmaintain an interaction database (e.g. the UniProt Consortium).It also ensures that the dataset will continue to be maintained

PERSPECTIVE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19942-z

2 NATURE COMMUNICATIONS | (2020) 11:6144 | https://doi.org/10.1038/s41467-020-19942-z | www.nature.com/naturecommunications

Page 3: Towards a unified open access dataset of molecular interactions

should funding be withdrawn or a resource decide to focus on adifferent area of research. For example, the data curated by theMicrobial Protein Interaction Database (MPIDB), a former IMExmember, has been maintained and updated within the IntActdatabase since this resource ceased curation in 2013. The IntActdatabase additionally holds a considerable number of recordscurated by several of the IMEx Consortium members (primarilyIntAct, MINT, DIP and UniProt), which pre-date the IMExagreement. There is currently no concerted effort planned tobring this legacy data up to IMEx standards, although individualrecords may be re-curated on demand.

The Consortium benefits as a whole from the expertise ofindividual member databases. In addition to the extracellularmatrix expertise of MatrixDB, HPIDB bring host-pathogeninteraction experience and InnateDB focuses on the curation ofmolecular interactions involved in the innate immune system.The consortium structure enables a rapid response to new areasof biology, as demonstrated by the early release of a coronavirusinteractome curated to full IMEx specifications based on pooledcuration resources and the knowledge of individual memberssuch as UniProtKB, which provided early access to proteinsequence data26.

The IMEx curation modelDuring the past 8 years, the IMEx curation model has beenenriched and refined with new data types, methodologies andadditional levels of detail. Every interaction curated by the IMExConsortium reflects a piece of experimental evidence manuallycurated from a publication or directly submitted by a data pro-ducer. The IMEx Consortium adheres to a detailed curationmodel, which comprises all aspects of an interaction experiment.The IMEx record includes information on host organism (withdetails about the cell line or tissue in which the assay was per-formed), methods for interaction detection and participantidentification, full details of the constructs (binding domains,effects of site-directed mutations, etc.), and further contextualinformation (e.g. any treatment of the host organism). All thisinformation is mapped to CV terms, in particular those describedby the HUPO PSI-MI CV (www.ebi.ac.uk/ols/ontologies/mi).Curated data records are linked to the source text from bothfigure legends and the main text body of the paper. This hasenabled the use of these data to develop and assess deep learningapproaches for text mining27.

The Consortium agreement7 was originally restricted to thecuration of PPIs, but this proved limiting to some resources andcompromised the ability of users to fully understand biologicalprocesses. Therefore, the remit of the Consortium has morerecently extended to cover protein–protein complex,protein–small molecule, protein–carbohydrate, protein–nucleicacid and nucleic acid–nucleic acid interactions. Curation guide-lines have been developed to enable the description of tran-scription factor-transcribed gene interactions. Furthermore,experimental techniques to capture ncRNA–protein andmiRNA–mRNA interactions have been included in the CVs andcuration guidelines. The Consortium is currently also examiningthe structured description of downstream effects of molecularinteractions, such as the up- or down-regulation of geneexpression.

Proteins are curated at the sequence level, using the UniProtKBdatabase as the reference resource for proteins and peptides. Theuse of UniProtKB enables the curator, for each publication, toaccurately describe the level of detail provided about the proteinsand to use identifiers for the unambiguous annotation of eachprotein interactor. For instance, a publication may only giveenough detail for an interactor to be mapped to any or all of the

protein isoform products of a specific gene, or more specifically toa single protein isoform, or to a post-translationally cleavedpeptide chain. UniProtKB supplies appropriate identifiers for allof these, and in each case supplies the corresponding underlyingsequence. Binding regions can be aligned to known proteindomains, as described by InterPro28. The effects of point muta-tions can be captured down to the amino acid level, using a CV todescribe their effect on an interaction. To capture this level ofdetail, the use of a high-quality protein reference resource isessential. Reverse engineering protein to gene identifiers to enablenetwork analysis of, for example, RNA-Seq data is a relativelytrivial task but it is considerably more difficult, if not impossible,to map isoforms and binding domain data directly to a genemodel or genomic sequence. Databases that curate PPI datadirectly to gene identifiers simply do not capture this wealth ofinformation.

The detailed biocuration of binding domains, mutations29 andpost-translational modifications (PTMs) requires that thesecoordinate-level mappings are kept synchronised with changes tothe underlying protein sequence database. An update to a pre-dictive gene model may result in a corresponding change to theprotein sequence(s) derived from it. Interactions involvingdomains and/or residues of that protein sequence then require acorresponding update to ensure that the mapping to the updatedsequence is correct. Update pipelines need to be run regularly, inline with the release cycle of the sequence database, namely every8 weeks in the case of UniProtKB. This is a computationallycomplex set of procedures run at the EMBL-EBI on the entireIMEx content, ensuring its consistency and representing oneadvantage of maintaining this as a single dataset. Similarly, allCVs used to describe an aspect of an interaction are updated withevery release.

Small molecule interactors, including carbohydrates and lipids,are mapped to the ChEBI database30, protein complexes to theComplex Portal31, mRNAs to Ensembl32 or ENA33, ncRNAs toRNAcentral34 and genes to Ensembl. Whilst in many cases thedatabase representation of these entities is largely more stablethan that of proteins, update pipelines for each of these will beenhanced and improved with time.

For analyses to be meaningful, data quality and full repre-sentation of experimental detail are of paramount importance.This is particularly required when working in an area of highdata complexity and heterogeneity. However, capturing thesedata poses a significant challenge for curators who need to befamiliar with an ever-growing set of experimental techniques.In order to minimise curation errors, all IMEx records aredouble-checked by a second curator prior to release. A morerecent innovation is cross-database checking, which ensuresthat curation standards remain consistent between memberdatabases. This is reinforced by Consortium workshops whererule changes and extensions are agreed, and joint curationexercises undertaken.

IMEx data content and interactome coverageThe IMEx data is constantly growing, with a new data releaseapproximately every 8 weeks. Whilst there are interactions cap-tured for a wide range of species, Table 1 highlights the pre-dominance of human PPI data. The significant fraction ofhuman-other species interactions includes a considerable numberof interactions tested with human proteins against close mam-malian (primarily mouse) orthologues, but also host–pathogeninteractions between human and viral/bacterial proteins. Othermodel organisms, such as S. cerevisiae, E. coli or A. thaliana arealso well represented in the dataset and a curation focus for IMExpartners.

NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19942-z PERSPECTIVE

NATURE COMMUNICATIONS | (2020) 11:6144 | https://doi.org/10.1038/s41467-020-19942-z | www.nature.com/naturecommunications 3

Page 4: Towards a unified open access dataset of molecular interactions

The human IMEx data set is dominated by PPIs. The numberof interactions involving other molecule types is considerablylower (Table 2), but is steadily increasing and will provideinteresting extensions to biologically relevant networks.

The curation performed by members of the IMEx Consortiumrepresents a significant proportion of the interaction data foundin publicly available databases. There is a degree of data redun-dancy with other databases that directly curate experimental datasuch as BioGRID35, and the IMEx dataset has been imported bymeta-databases such as STRING36, mentha22, IID14 and HiP-PIE37. Figure 1 shows the overlap between unique humaninteracting pairs in all these different resources, plus legacy datacurated by IMEx partners as an intersection plot38. The greatestoverlap exists between IMEx and BioGRID, which is well repre-sented in the meta-databases considered, and between BioGRIDand the rest of meta-databases. Figure 1 also highlights that thenumber of interacting pairs in BioGRID is larger than the IMExdataset. This is due to the less detailed curation model adopted byBioGRID, which allows for faster curation and inclusion of datasets that are not considered as indicating physical interactions bythe stringent IMEx rules (e.g., co-fractionation studies)35.

Figure 2 compares the number of unique interacting pairs inthe main model organisms (Homo sapiens, Mus musculus, Ara-bidopsis thaliana, Drosophila melanogaster, Caenorhabditis ele-gans, and Saccharomyces cerevisiae), highlighting also the type ofstudies from which the data was curated. Most of the interactingpairs hosted in primary databases come from high-throughputpublications (>100 interacting pairs in an experiment or >100interactors in an n-ary interaction). A comparison with BioGRIDis included as this is the only other publicly available, manuallycurated interaction database.

Finally, interactions in IMEx involve most of the representa-tive/canonical proteins present in the human, mouse and S. cer-evisiae proteomes as represented in the reviewed UniProtKB/Swiss-Prot (Fig. 3). Lower coverage is found in other modelorganisms such as D. melanogaster and C. elegans, partiallybecause smaller fractions of these proteomes are represented inUniProtKB/Swiss-Prot. Many mappings are to proteins in theunreviewed UniProtKB/TrEMBL section and do not necessarilytranslate into UniProtKB/SwissProt when reviewed by UniProtcurators. This makes the estimate of proteome coverage moredifficult.

Tools to visualize and query IMEx dataComplexViewer. The ComplexViewer39 has been designedspecifically to visualise detailed data annotated by the IMExConsortium. Its capabilities include a visual representation of arange of biomolecules as interactor types (proteins, smallmolecules and nucleic acids), interactions with more than twoparticipants (n-ary interactions), sequence features relevant tothe interaction (e.g., binding domains) and stoichiometryinformation. The data is taken from a JavaScript ObjectNotation format MI-JSON, which can be generated from anyPSI-MI compliant data source using the Java MolecularInteractions library (JAMI, see below)40. This tool has beenincorporated into the Complex Portal31, HumanMine41

(http://www.humanmine.org) and YeastMine42 (http://yeastmine.yeastgenome.org) data warehouses, and is freelyavailable for implementation by additional resources (https://www.npmjs.com/package/complexviewer; http://biojs.io/d/complexviewer).

ProtVista. The UniProt team has developed ProtVista43, aninteractive tool for visualisation of a wide range of proteinsequence features together in the same space. ProtVista isimplemented using JavaScript and makes extensive use of D3(https://d3js.org/), a library for producing dynamic, interactivedata visualisations in web browsers. Work is currently underwayto introduce an interaction track using the IMEx Consortiumdata, to visualise protein binding domains within the sequence ofthe protein represented in a UniProtKB entry, and to show theproteins to which it binds.

PSICQUIC. PSICQUIC is a webservice created to enable com-putational access to standards-compliant molecular interactiondata resources20,21. PSICQUIC defines a minimum set of stan-dard SOAP (Simple Object Access Protocol) and REST (Repre-sentational state transfer) methods which accept a MolecularInteractions Query Language (MIQL) query as input and returnmolecular interaction information in one of the standard formats(PSI-XML2.5, MITAB). PSICQUIC enables the IMEx Con-sortium to make data available to the research community forrapid search and download. The current PSICQUIC imple-mentation provides only limited access, however, to the wealth ofdata on molecule binding features, such as binding domains and

Table 2 Number of binary molecular interactions fordifferent interactor types present in the IMEx dataset, May2020 (human–human data only).

Binary interaction Interaction count

Protein–protein 490,061Protein–protein complexa 155Protein–small molecule 6469Protein–DNA 7649Protein–gene 1055Protein–RNA (all types) 3511RNA–RNA (all types) 490miRNA–mRNA 121

aProtein complex refers to a stable macromolecular functional unit which can be linked to acorresponding entry in the Complex Portal. It is used by curators when the interactoridentification can only be made at the complex level.

Table 1 Number of binary protein–protein interactions forselected model organisms present in the IMEx dataset,May 2020.

Binary pair No. interactions IMEx/IMEx+legacy

Human–human (non-redundant) 490,061/521,353 (259,962/278,983)

Human–mouse 29,510/31,478Human–any mammala 526,772/561,062Human–bacteria 10,720/10,800Human–virus 21,480/22,811Mouse–any mammal 51,578/80,230Rat–any mammal 11,236/14,105Drosophila melanogaster–Drosophilamelanogaster

48,067/51,741

Caenorhabditis elegans–Caenorhabditiselegans

12,526/16,970

Arabidopsis thaliana–Arabidopsis thaliana 52,131/55,661Saccharomyces cerevisiae–Saccharomycescerevisiae

72,227/132,104

Escherichia coli–Escherichia coli 19,318/28,513

aIncludes human–human.

PERSPECTIVE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19942-z

4 NATURE COMMUNICATIONS | (2020) 11:6144 | https://doi.org/10.1038/s41467-020-19942-z | www.nature.com/naturecommunications

Page 5: Towards a unified open access dataset of molecular interactions

the effect of site-directed mutations, and plans are in place toextend its capabilities.

JAMI. The recently published JAMI library has been developed toimport and export molecular interaction data in a variety offormats and versions, providing a single change-resilient softwarecomponent40. Software and tools developed on top of the JAMIframework are able to integrate and support all versions ofMITAB, PSI-MI XML, MI-JSON and XGMML. JAMI’s model

interfaces are abstracted from the underlying format, masking therequirements of each data format from developers. JAMI iscapable of serving the full richness of IMEx data to tools andresources built on this framework. It has been implemented bythe IntAct and Complex Portal databases in addition to theInterMine data warehouse.

RpsiXML. RpsiXML44 provides an R interface to PSI-MI XML2.5files and can therefore be used to query the IMEx data. It provides

60,000

Inte

rsec

tion

size

40,000

20,000

0IMEx

IID.exp

Database size

3e + 05 2e + 05 1e + 05 0e + 00

STRING.exp

BioGRID

mentha

HIPPIE

IMEx + legacy

Fig. 1 Overlap between Unique human PPIs from different resources. The intersection plot shows several PPI resources as well as legacy datacurated by IMEx partners. Individual resource sizes are displayed as horizontal bars on the lower left corner of the image. Intersection sizes areshown as individual bars, with those including IMEx data highlighted in grey. Specific resources involved in each intersection are identified withconnected solid black circles under the vertical bars, with unconnected circles representing pairs that are found exclusively in the correspondingdatabase. Shading in the background of the circles helps differentiate between primary databases (green) and meta-databases (orange). Only theexperimentally-derived content of the IID and STRING databases was used, excluding all predicted or text-mined interaction data. Only the 20 largestof all possible intersections are shown for clarity.

3e + 05

Homosapiens

Musmusculus

Arabidopsisthaliana

Drosophilamelanogaster

Caenorhabditiselegans

Saccharomycescerevisiae

Referenceproteome size(includes isoforms)

20,000

40,000

60,000

Publication type

High- and low-throughput

High-throughput

Low-throughput

No

of in

tera

ctin

g pa

irs

2e + 05

1e + 05

0e + 00

Databse

IME

x

IME

x +

lega

cy

Bio

GR

ID

IME

x

IME

x +

lega

cy

Bio

GR

ID

IME

x

IME

x +

lega

cy

Bio

GR

ID

IME

x

IME

x +

lega

cy

Bio

GR

ID

IME

x

IME

x +

lega

cy

Bio

GR

ID

IME

x

IME

x +

lega

cy

Bio

GR

ID

Fig. 2 Number of unique interacting pairs in selected model organisms present in interaction databases. Differently coloured bars indicate the type ofstudies from which the data were curated. Data were accessed in January 2020.

NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19942-z PERSPECTIVE

NATURE COMMUNICATIONS | (2020) 11:6144 | https://doi.org/10.1038/s41467-020-19942-z | www.nature.com/naturecommunications 5

Page 6: Towards a unified open access dataset of molecular interactions

a link between the detailed information held in these files and thediverse range of analytical tools provided in the Bioconductorsoftware environment.

Examples of uses of the IMEx datasetBuilding high-quality networks for data analysis. Proteininteraction networks are used by biologists to understand theinterconnectivity of signalling systems inside and outside of thecellular environments, exploring both physiologically normal anddisease states. Network-based analysis is a powerful technique forextracting biological insights from large datasets. It enablesresearchers to associate proteins of unknown functions tohuman-curated pathways or identify clusters of interactingmolecules, which may participate in the same biological processor belong to the same physical complex45. Network topology cansuggest the biological properties and function of componentmolecules. Algorithms exist to analyse aspects of network struc-tures (such as their scale-free, small world, geometric random orhierarchical nature), to calculate network metrics (e.g., cen-tralities, shortest path, clustering coefficient and graphlets) or toinvestigate interaction motifs driving specific interactionsbetween different biomolecules. The results of any network ana-lysis will inevitably depend on the coverage and quality of theprotein interaction network and the experimental conditionsunder which the interacting network was derived. IMEx data-derived networks have been incorporated into several clinicalresources such as RD-Connect GPAP (https://platform.rd-connect.eu), DISNOR46, CancerGeneNET47 and OpenTargets48,enabling researchers to investigate genomic data and the con-sequence of disease variants.

The detailed IMEx curation model enables data selection andfiltering on many levels and thus can contribute to the building of

high-quality and context-specific networks. At the simplest level,such filtering can be performed using the manually curated‘interaction type’ described by a series of terms contained in theMI CV. For example, it is possible to filter for ‘direct interaction’,or child terms, thus making specific smaller networks of the veryhighest quality. This approach was taken by Sacco et al.49 whoqueried IMEx data via the mentha interactome browser to buildand analyse phosphatase sub-networks to identify new phospha-tase substrates. Conversely, data types can be filtered out of anetwork. For example, computationally expanded data could beremoved from a network using CV terms from the ‘curationcontent’ (MI:1045) branch which contains child terms such as‘spoke expansion’ (MI:1060). IID supports filtering PPIs based onsource, number of studies, number of bioassays, broad or detailedtissue, disease, cellular localisation and druggability. TheMatrixDB resource supports filtering PPIs based on interactiondetection method, gene expression and protein levels, disease,Gene Ontology terms, and UniProtKB keywords to build specificinteractomes25, e.g. tissue-specific basement membrane inter-actomes, and to define consensus interactomes composed of theinteractions common to all basement membranes11. Furthermore,it is possible to search the IntAct website or PSICQUICwebservice using the hierarchical structure of the ontology. Forexample, the term ‘lipid’ (CHEBI:18059) will identify all lipid-binding molecules even when the detailed annotation is to acholesterol (CHEBI:16113) or phosphatidyl 3,4,5 inositol trispho-sphate (CHEBI:16618) binding event.

Whilst these simple filters have been possible since the firstrelease of IMEx curated data, for the last 6 years IMEx data havebeen scored using an implementation of MIscore50, thus enablingmore sophisticated filtering. MIscore relies on the availableannotation evidence associated with an interaction and representsthe degree of confidence in the existence of a particular

Homo sapiens

Org

anis

m

UniProtKB/Swiss-Prot

% of proteome coverage0

IMEx IMEx + legacy BioGRIDDatabase

25 7550 0 25 7550

UniProtKB/TrEMBL

Mus musculus

Arabidopsis thaliana

Drosophila melanogaster

Caenorhabditis elegans

Saccharomyces cerevisiae

Fig. 3 Coverage of UniProtKB model organism proteomes in interaction databases. Shown is the fraction of all proteome entries of key model organismsrepresented in UniProtKB/Swiss-Prot (manually reviewed) and UniProtKB/TrEMBL (computationally inferred) that are found in primary interactiondatabases (IMEx, IMEx + legacy data and BioGRID).

PERSPECTIVE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19942-z

6 NATURE COMMUNICATIONS | (2020) 11:6144 | https://doi.org/10.1038/s41467-020-19942-z | www.nature.com/naturecommunications

Page 7: Towards a unified open access dataset of molecular interactions

interaction. The scoring system takes three factors into account,and uses the CV terms added by the IMEx curators:

(1) How the interaction was observed (interaction detectionmethod; MI:0001)

(2) The type of interaction: e.g., direct interaction, physicalassociation and colocalization. (interaction type; MI:0190)

(3) The number of publications reporting a specific interaction

The results are normalised on a 0–1 scale.Searching the IMEx dataset with the query “intact-miscore:[x

TO y]” enables the user to select data subsets by confidencescore. At the time of writing, the authors recommend a MIscorerange of 0.45–1 to identify medium confidence interactions and0.6–1 for high confidence sets. These thresholds approximately

correspond to interactions found with at least two distinctpieces of evidence (MIscore > 0.45) or those found with three ormore pieces of evidence, obtained with different methodologies(MIscore > 0.6). The MIscore functionality is used by theReactome pathways database Molecular Interaction over-lay50,51, which allows protein–protein or protein–small mole-cule interactions to be superimposed onto a pathway diagram.For example, Fig. 4a shows the tyrosine kinase ZAP70(UniProtKB: P43403) in the Reactome TCR signaling pathwayoverlaid with 9 protein interactors imported from IMEx. Thedefault setting provides fast access to a quarterly updated andlocally hosted version of IMEx data with a MIscore threshold of0.45, which selects ~50% of all interactions in IMEx. Interac-tions with a confidence level equal to or above this thresholdwill be visible in the viewport. The Reactome pathway analysisservice also gives the user the option to set a more stringentMIscore filter using a slider feature, to select alternativeinteraction databases, and to perform an extended analysisthat includes the IMEx interaction dataset served from theIntAct database.

Two studies independently applied data filtering to essentiallythe same network to investigate the biology of LRRK2, a proteinlinked to familial forms of Parkinson’s disease, using the datamade publicly available by IMEx partners. Porras et al.52 filteredthe dataset using MIscore to generate a high-confidence sub-network, which was used to produce a draft list of highconfidence LRRK2 interacting partners. Manzoni et al.53 filteredby the number of publications reporting an interaction and thenperformed Gene Ontology network analysis on the LRRK2interactome with edges identified by 2 or more publications. Bothgroups showed that the LRRK2 network was associated withterms referring to transport, cellular organization, vesicles and thecytoskeleton. Experimental data have since shown LRRK2 to beassociated with selected Rab GTPases, and are now also present inthe IMEx dataset54.

MIscore was designed to be customisable and a differentversion of the algorithm is used to identify binary interactionsfor export to the UniProtKB Interaction lines and to the GeneOntology database. All binary interaction evidence in theIntAct database, including the data generated by spokeexpansion of co-complex data, are clustered to produce anon-redundant set of protein pairs. Each binary pair is thenscored using a variation of MIscore to give a simple addition ofthe accumulated value of a weighted score for the interactiondetection method and the interaction type for each interactionevidence associated with that binary pair. Once the interactionshave been scored, a threshold is established, below which theinteraction is not exported to UniProtKB and to the GeneOntology annotation files. The threshold is stringent enough toensure no interaction is exported based on only a single pieceof experimental evidence. Additional rules ensure that anyprotein pair scoring above the cut-off must also include at leastone piece of evidence of a binary pair, excluding spokeexpanded and co-localisation data, to be exported to Uni-ProtKB and Gene Ontology annotation. The UniProtKBdataset is then displayed in the appropriate entries in anadjacency viewer (Fig. 4b).

The IMEx dataset has also been used as a gold-standardtraining set in text-mining exercises such as the Biocreativecompetitions. These exercises make use of IMEx data partlybecause a linkage between figures and experiments has beenmaintained during the process of IMEx biocuration55.Furthermore, in a subset of IMEx entries, sentences identify-ing interacting entities and the interaction detection methodhave been systematically captured in the ‘source-text’ annota-tion field. These sentences are highlighted for readers on

KIT_HUMAN

P43403 has binary interactions with 10 proteinsb

c

a

MET_HUMAN

LCK_HUMANCBL_HUMAN

CD3E_HUMAN

CD3Z_HUMANTHMS1-1_HUMAN

EGFR_HUMAN

sso1_yeast

snc1_yeast

1

1 200 400 651

290

117

sec9_yeast

PTN22_HUMAN

UBS3B_MOUSE

KIT

_HU

MA

N

ME

T_H

UM

AN

LCK

_HU

MA

NC

BL_

HU

MA

N

CD

3E_H

UM

AN

CD

3Z_H

UM

AN

TH

MS

1-1_

HU

MA

N

EG

FR

_HU

MA

N

PT

N22

_HU

MA

N

UB

S3B

_MO

US

E

ZAP70_HUMAN

ZA

P70

_HU

MA

N

Fig. 4 Presentation of IMEx data in different resources. a, b IMExdata for the human ZAP70 protein (P43403) represented in a Reactomeand b UniProt. c IMEx-derived binding domain data for the S. cerevisiaevesicular SNARE complex SSO1-SEC9-SNC1 (CPX-1365) used in theComplex Portal.

NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19942-z PERSPECTIVE

NATURE COMMUNICATIONS | (2020) 11:6144 | https://doi.org/10.1038/s41467-020-19942-z | www.nature.com/naturecommunications 7

Page 8: Towards a unified open access dataset of molecular interactions

abstracts and full text articles in Europe PMC via the SciLitetool56.

Characterising protein isoforms and features. Most eukaryoticprotein-coding genes transcribe more than one isoform. Thedifferent functions of isoforms are sometimes known or can beinferred (for example specific isoforms do/do not contain certainfunctional domains), but in many cases the biological significanceof multiple isoforms derived from the same gene is not under-stood. However, the different interaction patterns of associatedisoforms may provide an indication of their different biologicalfunctions by analysing their respective binding partners. In 2013,Talavera et al.57 published an editorial stating that “it is crucial tothe advance of basic and medical research that interactions arereported on an isoform-to-isoform basis and that databasesswitch to a similar approach”. The IMEx databases curate thisinformation, whenever the data is made available by authors,making isoform comparisons possible. UniProtKB identifiersenable curators to differentiate between transcripts being identi-fied at the isoform or canonical (reference sequence) level. Over100,000 interactions in IMEx (~12% of IMEx data) contain spe-cific isoform information, with more than 11,000 records con-taining specific isoform–isoform interactions. The UniProtKBdatabase recently (release 2020_02) refactored the Interactionsection of their records to improve the display of isoform dataimported from IMEx. It is anticipated that the availability of suchdata will increase as protein identification techniques improve oras authors realise the value of such data and include this level ofdetail in publications.

IMEx also captures so-called negative interactions, which willbe of increasing use in the future. These data largely pertain toisoform-specific interactors, and describe cases where certainisoforms of a gene bind to a bait protein, while other isoforms ofthe same gene do not bind to the same bait in the same assaysystem. IMEx curation rules mandate publication of the proteinexpression levels of the negative interactors to exclude poorprotein expression as a reason for the lack of interaction.

To fully comprehend protein interactions, researchers fre-quently need to identify the sequence region to which a moleculebinds and any modification to that sequence. Any change to anamino acid sequence has the potential to influence the moleculeswith which the protein interacts. The IMEx Consortium capturesthese variations, thereby supporting the analysis of their down-stream effects as shown in the examples below.

Binding domains. One critical piece of information captured bythe information-rich IMEx curation model is the minimum‘sufficient binding region’ (MI:0442) or ‘necessary binding region’(MI:0429) of a protein derived from an interaction experiment.When a binding domain maps to a known protein domain, across-reference to the appropriate InterPro entry is added. Cap-ture of data to this level of detail has enabled, for example, animproved understanding of the role of the SH2 domain includinga classification of its target protein specificity58 and the identifi-cation of the WD40 domain as potentially being directly involvedin ncRNA interactions59. The binding regions captured by theIMEx Consortium have also been used for the precise mapping ofbinding domains within protein complexes in the EBI ComplexPortal31, e.g. in the Saccharomyces cerevisiae vesicular SNAREcomplex SSO1-SEC9-SNC1 (CPX-1365) formed by SNARE-SNARE domain binding (Fig. 4c).

Post-translational modifications. PTMs, such as the phosphor-ylation of amino acid side chains, increase the complexity of theproteome and are essential for driving molecular interactions.

These modifications can change PPIs by causing protein oligo-merization and aggregation, binding to or dissociation from otherproteins, protein conformational changes or local unfolding. TheIMEx Consortium differentiates between a ‘prerequisite-ptm’(MI:0638), which is required for an interaction to occur, and an‘observed-ptm’ (MI:0925), which has been experimentally vali-dated but not shown to be required for the interaction. Forexample, a phosphorylation event can introduce a charge in ahydrophobic environment, destabilising an interaction. This canbe systematically described using CV terms such as ‘ptm dis-rupting an interaction’ (MI:1225).

Post-translational cleavage of a polypeptide is also a PTMyielding a mature protein or bioactive peptide chains, which mayhave interaction repertoires very different from those of theoriginating full-length transcript. Using the mature protein chainidentifiers supplied by the UniProtKB database allows IMExcurators to accurately capture the form of the protein used in theassay. This is of particular importance for the annotation of theinteractions of viral proteins where one gene in the viral genomemay encode multiple proteins. The protein interactions of thesepost-processed protein and peptide chains cannot be mean-ingfully described when protein interactions are only captured atthe gene level.

Reversible and transient PTMs transmit and amplify signals ina highly regulated manner by reversible site-specific modulation,and thus play a key role in signal transduction60. PTMs are oftenthe result of an enzyme acting on a substrate and theenzyme–substrate reaction can be taken as evidence of a directinteraction in the IMEx data model. The PTM resulting from thisinteraction is additionally captured, using the ‘resulting-PTM’(MI:0639) term. Cell signalling resources, such as the SIGNORdatabase61 have used the relationships between enzymes andsubstrates from the IMEx dataset and the effects of resultingPTMs on interactions to derive causal interactions.

More recently, the IMEx Consortium has also started tocapture the effects of chemical modifications of RNA molecules,several of which undergo specific nucleotide modifications duringtheir maturation.

Point mutations. To understand how amino acid variationsinfluence protein function and stability, researchers have formany years examined the effect of induced point mutations onprotein interactions. These targeted changes to the amino acidsequence of a protein may mimic known sequence variants,remove post-translational modification sites, disrupt regionsrequired for protein stability or alter the properties of proteinbinding domains. The IMEx Consortium has been collectingthese data29 using CV terms to describe the observed effects suchas ‘mutation decreasing interaction strength’ (MI:0116), ‘muta-tion increasing interaction rate’ (MI:1131) or ‘mutation causingan interaction’ (MI:2227). The curation rules have recently beenextended to include deep mutational scanning data such asdescribed by Woodsmith et al.62. At the time of writing, this setconsists of 58,000 point mutations representing 20,000 interac-tion evidences annotated with differentially reported effects. Tomake these data more readily available to the user community,the IMEx Consortium has recently concatenated this dataset andmade it available in a tab-delimited format (FeatureTAB)29. Thisnew data format includes details of the position and the aminoacid change of the mutation, the interacting molecules and theeffect of the mutation on the interaction.

The IMEx Consortium mutation-specific dataset is available todownload at http://ftp.ebi.ac.uk/pub/databases/intact/current/various/mutations.tsv. The data have already been used toprovide potential mechanisms of action for disease related amino

PERSPECTIVE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19942-z

8 NATURE COMMUNICATIONS | (2020) 11:6144 | https://doi.org/10.1038/s41467-020-19942-z | www.nature.com/naturecommunications

Page 9: Towards a unified open access dataset of molecular interactions

acid variants, to investigate the destabilising effect of mutationson structural models of protein–protein interfaces, and tobenchmark tools predicting the effect of SNPs on proteinfunction29. In another recent study, both the protein interactionnetwork from IID23 and the mutation-specific interaction datasethave been used to explain survival and treatment resistance inovarian cancer63. Patients with a TP53 p.Arg175His mutationhave a poor prognosis, which may be explained by chemoresis-tance due to a disrupted TP53 interaction with the BCL2complex64,65 as well as several additional mechanisms, ashighlighted in Fig. 5.

The IMEx database also collects information concerning thenucleotides involved in miRNA-mRNA interactions and theeffect of nucleic acid substitutions on binding. These data areaccurately mapped on the mRNA sequence, making them avaluable resource for modelling miRNA interactions and theregulation of gene expression at specific targets. To the best of ourknowledge, the IMEx Consortium is the only group capturing thisinformation.

Protein tags. Protein tags are chemical additions to a molecule toenable its identification and/or purification. They often consist of

peptide sequences genetically attached onto a recombinant pro-tein. As the nature and position of a tag may affect its interactionprofile, this information must be recorded to enable researchersto fully comprehend the consequences of such changes66. Forexample, a tag expressed as either an N- or C-terminal fusion ofEbola Virus VP24 protein identified 48 and 51 interacting humanproteins, respectively, of which only 40 proteins are common toboth fusions67,68.

Contextualising protein interactions with metadata. As detailedabove, the interaction repertoire of a protein depends on manyfactors, not least the cellular environment in which they areexpressed. The detailed metadata now captured in an IMExrecord have been significantly expanded since the work of theConsortium was first described. Most techniques for measuringinteractions do not use native conditions and, when described bythe author in the publication, these data are captured in a sys-tematic way using ontologies and CV terms. In addition to fulldetails on the host organism, experimental methodology andconstruct details, IMEx curators now routinely capture expressionlevel, any full or partial purification of a molecule, and themethod by which constructs are delivered or engineered into a

BindingCatalytic activityChannel regulator activityEnzyme regulator activityRegulation of molecular functionTranscription factor activityReceptor regulator activityStructural molecule activityTranslation regulator activityTransporter activity

Uncategorized

Node color: go molecular functionMutation increased interaction

Mutation disrupted interaction

Mutation affected interaction

Mutation decreasing interaction

BLM

MDM

2

TP53

HD

AC

1

KAT

2BPIN

1MED1

VRK1

NFY

A

NFYB

KM

T2E

VDR

DA

XX

BCL2L1

SMAD2

ET

S2

PIAS4

SAFB

TP63

TOP3A

SREBF2

SREBF1PARD3

DR

OS

HA

MPDZ

NRDC

siH5

71gr

A

Arg

273His

Arg

248Trp

Arg

248T

rp

Arg273His

Arg273His

Arg273His

Arg273His

Arg273His

Arg273His

Arg273His

Arg273His

Arg17

5His

Arg175His

Arg175His

Arg175His

Arg175H

is

Arg175His

Arg175His

Arg

175H

is

Arg

175H

isArg175H

is Arg

175H

is

Arg175His

Arg

175H

is

Fig. 5 Mutation-specific TP53 interaction network. Survival-related patient mutation data57 overlaid onto the mutation-specific TP53 network derivedfrom interactions from IID (version 2020-05) and mutation-specific interaction data from IMEx. Survival-related mutations are shown on the networkedges. It should be noted that many interactions are affected by many more mutations, but these are not necessarily linked to ovarian cancer. The networkis visualised in NAViGaTOR73 ver 3.0.13. Node colour corresponds to GO molecular function, edge colour represents the effect of the mutation on theinteraction.

NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19942-z PERSPECTIVE

NATURE COMMUNICATIONS | (2020) 11:6144 | https://doi.org/10.1038/s41467-020-19942-z | www.nature.com/naturecommunications 9

Page 10: Towards a unified open access dataset of molecular interactions

cell or expression system. This information allows analysing theeffects of environmental change on interaction patterns. Forexample, it has been noted that the Huh7.5 cell line differs fromthe 7.0 parent cells by a single mutation (p.Thr55Ile) in Retinoicacid-inducible gene I (RIG-I) protein, which impairs interferonsignaling69. Analyses by the MacCarthy group using datauniquely provided by IMEx suggest notable differences in thepattern of host proteins interacting with HCV proteins from thesetwo different cell lines67. Huh7 proteins interacting with proteinsfrom HCV are involved in cancer, apoptosis, immune defenceresponse and cell cycle functions, whilst the equivalent Huh7.5proteins are enriched for protein folding, localisation, andtransport.

Recent changes to both the database model and PSI-MIXML3.0 download format18 allow curators to capture dynamicinteractions, such as changes in sub-network composition atdifferent stages of the cell cycle or in response to changingconcentrations of an agonist/antagonist, pH changes and otherfactors. It is also possible to describe the directionality of areaction or binding event and the result of a directional bindingevent, such as an up- or down-regulation of the target’s activity.Once sufficient data have been accumulated, they will be madeavailable to users in a new tab-delimited format70.

Future and sustainabilitySince first described in 2012, the IMEx Consortium has gainednew member data resources and curation groups; these includethe UniProt Consortium and the Functional Gene Annotationgroup at University College London. Some previous memberresources ― such as MPIDB ― are no longer in existence,but the data remain in the IntAct database and are updated witheach release of UniProt. The IMEx Consortium has releasedalmost 900,000 interaction evidences to IMEx curation standardsand continues to provide access to another 100,000 legacy binaryevidences. It has expanded the IMEx dataset to include newinteractor types, new methodologies and new data types such asdynamic interactions. The Consortium remains open to theparticipation of new partners, and will make access to the IntActeditorial tool, curation training and data quality control availableon request. For detailed information on both IMEx membershipand on data deposition, please see https://www.imexconsortium.org.

Data producers can contribute to the IMEx project in three keyways. First, by depositing interaction data with one of the Con-sortium partners as an integral part of the publication process.Second, by always clearly identifying all the constructs used inany interaction experiment71, ideally by the addition of anaccession number from a database such as UniProtKB or bymaking the species of origin of any clone very clear (for example“Human hemagglutinin (HA)-tagged RRP1B (Q14684)….”).Additional sequence detail will enable mapping to the correctisoform, when relevant. Third, data producers can request thecuration of papers, particularly when these supply interactionsmissing from an interactome, or bring additional details to aninteraction that is not already present in the dataset. Researchersseeking assistance with these requirements may contact [email protected].

The IMEx Consortium received EC funding to establish itselfand has more recently received UK BBSRC and NIH grants.Currently, the consortium relies on localised, national fundingand research grants to maintain resources. IMEx is the firstConsortium to be recognised as an ELIXIR core resource, high-lighting the importance this organisation places on databasecollaboration and data sharing. ELIXIR has supplied local fundingto support member databases and to fund Consortia meetings. It

is hoped that the recognition by ELIXIR will result in longer termfunding for Consortium-wide activities. The Global BiodataCoalition72 is currently looking to extend this model of sustain-able funding for core-data resources in the life sciences givinghope for the long-term future of key resources such as the IMExConsortium.

Received: 17 April 2020; Accepted: 9 November 2020;

References1. Das, J. et al. Exploring mechanisms of human disease through structurally

resolved protein interactome networks. Mol. Biosyst. 10, 9–17 (2014).2. Correia, F. B., Coelho, E. D., Oliveira, J. L. & Arrais, J. P. Handling noise in

protein interaction networks. Biomed. Res. Int. 2019, 8984248 (2019).3. Luck, K., Sheynkman, G. M., Zhang, I. & Vidal, M. Proteome-scale human

interactomics. Trends Biochem. Sci. 42, 342–354 (2017).4. von Mering, C. et al. Comparative assessment of large-scale data sets of

protein-protein interactions. Nature 417, 399–403 (2002).5. Venkatesan, K. et al. An empirical framework for binary interactome

mapping. Nat. Methods 6, 83–90 (2009).6. Choi, S. G. et al. Maximizing binary interactome mapping with a minimal

number of assays. Nat. Commun. 10, 3907 (2019).7. Orchard, S. et al. Protein interaction data curation: the International

Molecular Exchange (IMEx) consortium. Nat. Methods 9, 345–350 (2012).First description of the IMEx Consortium and detailing of the curationmodel.

8. Orchard, S. et al. The MIntAct project–IntAct as a common curation platformfor 11 molecular interaction databases. Nucleic Acids Res. 42, D358–D363(2014).

9. Xenarios, I. et al. DIP: The Database of Interacting Proteins: 2001 update.Nucleic Acids Res. 29, 239–241 (2001).

10. UniProt Consortium. UniProt: a worldwide hub of protein knowledge. NucleicAcids Res. 47, D506–D515 (2019).

11. Clerc, O. et al. MatrixDB: integration of new data with a focus onglycosaminoglycan interactions. Nucleic Acids Res. 47, D376–D381 (2019).

12. Breuer, K. et al. InnateDB: systems biology of innate immunity andbeyond–recent updates and continuing curation. Nucleic Acids Res. 41,D1228–D1233 (2013).

13. Ammari, M. G., Gresham, C. R., McCarthy, F. M. & Nanduri, B. HPIDB 2.0: acurated database for host-pathogen interactions. Database 2016, baw103(2016).

14. Pastrello, C., Kotlyar, M. & Jurisica, I. Informed use of protein-proteininteraction data: a focus on the Integrated Interactions Database (IID).Methods Mol. Biol. 2074, 125–134 (2020).

15. Drysdale, R. et al. The ELIXIR Core Data Resources: fundamentalinfrastructure for the life sciences. Bioinformatics 36, 2636–2642 (2020).

16. Orchard, S. & Hermjakob, H. Shared resources, shared costs–leveragingbiocuration resources. Database. https://doi.org/10.1093/database/bav009(2015).

17. Kerrien, S. et al. Broadening the horizon–level 2.5 of the HUPO-PSI formatfor molecular interactions. BMC Biol. 5, 44 (2007). Description of the formats,standards and controlled vocabularies which enable release of the IMEx dataset in a single format which is re-usable by other resources.

18. Sivade Dumousseau, M. et al. Encompassing new use cases - level 3.0 of theHUPO-PSI format for molecular interactions. BMC Bioinform. 19, 134 (2018).

19. Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific datamanagement and stewardship. Sci. Data 3, 160018 (2016).

20. Aranda, B. et al. PSICQUIC and PSISCORE: accessing and scoring molecularinteractions. Nat. Methods 8, 528–529 (2011).

21. del-Toro, N. et al. A new reference implementation of the PSICQUIC webservice. Nucleic Acids Res. 41, W601–W606 (2013).

22. Calderone, A., Castagnoli, L. & Cesareni, G. mentha: a resource for browsingintegrated protein-interaction networks. Nat. Methods 10, 690–691 (2013).

23. Kotlyar, M., Pastrello, C., Malik, Z. & Jurisica, I. IID 2018 update: context-specific physical protein-protein interactions in human, model organisms anddomesticated species. Nucleic Acids Res. 47, D581–D589 (2019).

24. Kotlyar, M., Pastrello, C., Sheahan, N. & Jurisica, I. Integrated interactionsdatabase: tissue-specific view of the human and model organism interactomes.Nucleic Acids Res. 44, D536–D541 (2016).

25. Launay, G., Salza, R., Multedo, D., Thierry-Mieg, N. & Ricard-Blum, S.MatrixDB, the extracellular matrix interaction database: updated content, anew navigator and expanded functionalities. Nucleic Acids Res. 43,D321–D327 (2015).

PERSPECTIVE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19942-z

10 NATURE COMMUNICATIONS | (2020) 11:6144 | https://doi.org/10.1038/s41467-020-19942-z | www.nature.com/naturecommunications

Page 11: Towards a unified open access dataset of molecular interactions

26. Perfetto, L. et al. The IMEx Coronavirus interactome: an evolving map ofCoronaviridae-Host molecular interactions. Database (Oxford). 2020:baaa096.https://doi.org/10.1093/database/baaa096 (2020).

27. Burns, G. A., Li, X. & Peng, N. Building deep learning models for evidenceclassification from the open access biomedical literature. Database 2019,baz034 (2019).

28. Mitchell, A. L. et al. InterPro in 2019: improving coverage, classification andaccess to protein sequence annotations. Nucleic Acids Res. 47, D351–D360(2019).

29. IMEx Consortium Curators. Capturing variation impact on molecularinteractions in the IMEx Consortium mutations data set. Nat. Commun. 10,10 (2019). IMEx dataset describing the effects of targeted mutations of theamino acid sequence of a protein on molecular interactions.

30. Hastings, J. et al. ChEBI in 2016: improved services and an expandingcollection of metabolites. Nucleic Acids Res. 44, D1214–D1219 (2016).

31. Meldal, B. H. M. et al. The complex portal–an encyclopaedia ofmacromolecular complexes. Nucleic Acids Res. 43, D479–D484 (2015).

32. Yates, A. D. et al. Ensembl 2020. Nucleic Acids Res. 48, D682–D688 (2020).33. Amid, C. et al. The European Nucleotide Archive in 2019. Nucleic Acids Res.

48, D70–D76 (2020).34. The RNAcentral Consortium. RNAcentral: a hub of information for non-

coding RNA sequences. Nucleic Acids Res. 47, D1250–D1251 (2019).35. Oughtred, R. et al. The BioGRID interaction database: 2019 update. Nucleic

Acids Res. 47, D529–D541 (2019).36. Szklarczyk, D. et al. STRING v11: protein-protein association networks with

increased coverage, supporting functional discovery in genome-wideexperimental datasets. Nucleic Acids Res. 47, D607–D613 (2019).

37. Alanis-Lobato, G., Andrade-Navarro, M. A. & Schaefer, M. H. HIPPIE v2.0:enhancing meaningfulness and reliability of protein-protein interactionnetworks. Nucleic Acids Res. 45, D408–D414 (2017).

38. Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R. & Pfister, H. UpSet:Visualization of Intersecting Sets. IEEE Trans. Vis. Comput. Graph. 20,1983–1992 (2014).

39. Combe, C. W. et al. ComplexViewer: visualization of curated macromolecularcomplexes. Bioinformatics 33, 3673–3675 (2017).

40. Sivade Dumousseau, M. et al. JAMI: a Java library for molecular interactionsand data interoperability. BMC Bioinform. 19, 133 (2018).

41. Smith, R. N. et al. InterMine: a flexible data warehouse system for theintegration and analysis of heterogeneous biological data. Bioinformatics 28,3163–3165 (2012).

42. Balakrishnan, R. et al. YeastMine–an integrated data warehouse forSaccharomyces cerevisiae data as a multipurpose tool-kit. Database 2012,bar062 (2012).

43. Watkins, X., Garcia, L. J., Pundir, S. & Martin, M. J. & UniProt Consortium.ProtVista: visualization of protein sequence annotations. Bioinformatics 33,2040–2041 (2017).

44. Gentleman, R. C. et al. Bioconductor: open software development forcomputational biology and bioinformatics. Genome Biol. 5, R80 (2004).

45. Rahmati, S. et al. pathDIP 4: an extended pathway annotations andenrichment analysis resource for human, model organisms and domesticatedspecies. Nucleic Acids Res. 48, D479–D488 (2020).

46. Lo Surdo, P. et al. DISNOR: a disease network open resource. Nucleic AcidsRes. 46, D527–D534 (2018).

47. Iannuccelli, M. et al. CancerGeneNet: linking driver genes to cancer hallmarks.Nucleic Acids Res. 48, D416–D421 (2020).

48. Koscielny, G. et al. Open Targets: a platform for therapeutic targetidentification and validation. Nucleic Acids Res. 45, D985–D994(2017).

49. Sacco, F. et al. Combining affinity proteomics and network context to identifynew phosphatase substrates and adapters in growth pathways. Front. Genet. 5,115 (2014).

50. Villaveces, J. M. et al. Merging and scoring molecular interactions utilisingexisting community standards: tools, use-cases and a case study. Database2015, bau131 (2015).

51. Croft, D. et al. The Reactome pathway knowledgebase. Nucleic Acids Res. 42,D472–D477 (2014).

52. Porras, P. et al. A visual review of the interactome of LRRK2: Using deep-curated molecular interaction data to represent biology. Proteomics 15,1390–1404 (2015).

53. Manzoni, C., Denny, P., Lovering, R. C. & Lewis, P. A. Computational analysisof the LRRK2 interactome. PeerJ 3, e778 (2015).

54. Steger, M. et al. Phosphoproteomics reveals that Parkinson’s disease kinaseLRRK2 regulates a subset of Rab GTPases. Elife 5, e12813 (2016).

55. Burns, G. A. P. C., Dasigi, P., de Waard, A. & Hovy, E. H. Automateddetection of discourse segment and experimental types from the text of cancerpathway results sections. Database 2016, 1–12 (2016).

56. Levchenko, M. et al. Europe PMC in 2017. Nucleic Acids Res. 46,D1254–D1260 (2018).

57. Talavera, D., Robertson, D. L. & Lovell, S. C. Alternative splicing and proteininteraction data sets. Nat. Biotechnol. 31, 292–293 (2013). Publicationarguing that isoform-to-isoform protein interactions should be reported toreflect the isoform-based specificity of interactions.

58. Tinti, M. et al. The SH2 domain interaction landscape. Cell Rep. 3, 1293–1305(2013).

59. Panni, S., Prakash, A., Bateman, A. & Orchard, S. The yeast noncoding RNAinteraction network. RNA 23, 1479–1492 (2017).

60. Lee, M. J. & Yaffe, M. B. Protein regulation in signal transduction. Cold SpringHarb. Perspect. Biol. 8, a005918 (2016).

61. Licata, L. et al. SIGNOR 2.0, the SIGnaling Network Open Resource 2.0: 2019update. Nucleic Acids Res. 48, D504–D510 (2020).

62. Woodsmith, J. et al. Protein interaction perturbation profiling at amino-acidresolution. Nat. Methods 14, 1213–1221 (2017).

63. Mandilaras, V. et al. TP53 mutations in high grade serous ovarian cancer andimpact on clinical outcomes: a comparison of next generation sequencing andbioinformatics analyses. Int. J. Gynecol. Cancer. https://doi.org/10.1136/ijgc-2018-000087. (2019).

64. Ferlini, C. et al. Bcl-2 down-regulation is a novel mechanism of paclitaxelresistance. Mol. Pharmacol. 64, 51–58 (2003).

65. Makhija, S., Taylor, D. D., Gibb, R. K. & Gerçel-Taylor, C. Taxol-induced bcl-2 phosphorylation in ovarian cancer cell monolayer and spheroids. Int. J.Oncol. 14, 515–521 (1999).

66. Rajagopala, S. V., Hughes, K. T. & Uetz, P. Benchmarking yeast two-hybridsystems using the interactions of bacterial motility proteins. Proteomics 9,5296–5302 (2009).

67. Ammari, M., McCarthy, F. & Nanduri, B. Leveraging experimental details foran improved understanding of host-pathogen interactome. Curr. Protoc.Bioinform. 61, 8.26.1–8.26.12 (2018).

68. García-Dorival, I. et al. Elucidation of the Ebola virus VP24 cellularinteractome and disruption of virus biology through targeted inhibition ofhost-cell protein function. J. Proteome Res. 13, 5120–5135 (2014).

69. Liu, H. M. & Gale, M. Hepatitis C virus evasion from RIG-I-dependenthepatic innate immunity. Gastroenterol. Res. Pract. 2010, 548390 (2010).

70. Perfetto, L. et al. CausalTAB: the PSI-MITAB 2.8 updated format forsignalling data representation and dissemination. Bioinformatics 35,3779–3785 (2019).

71. Orchard, S. et al. The minimum information required for reporting amolecular interaction experiment (MIMIx). Nat. Biotechnol. 25, 894–898(2007).

72. Anderson, W. P. & Global Life Science Data Resources Working Group.Data management: a global coalition to sustain core data. Nature 543, 179(2017).

73. Brown, K. R. et al. NAViGaTOR: Network analysis, visualization and graphingToronto. Bioinformatics 25, 3327–3329 (2009).

AcknowledgementsThe IntAct team at EMBL-EBI received funding from EMBL core funding, Open Targets(grant agreements OTAR-044 and OTAR02-048) and the Wellcome Trust (BiomedicalResources grant INVAR #3367). This work by UniProt was supported by the NationalEye Institute (NEI), National Human Genome Research Institute (NHGRI), NationalHeart, Lung, and Blood Institute (NHLBI), National Institute on Aging (NIA), NationalInstitute of Allergy and Infectious Diseases (NIAID), National Institute of Diabetes andDigestive and Kidney Diseases (NIDDK, National Institute of General Medical Sciences(NIGMS), National Cancer Institute (NCI) and National Institute of Mental Health(NIMH) of the National Institutes of Health under Award Number [U24HG007822] (thecontent is solely the responsibility of the authors and does not necessarily represent theofficial views of the National Institutes of Health) and EMBL core funding and OpenTargets (grant agreements OTAR02-048). Support for IID was supplied in part byOntario Research Fund (#34876), Natural Sciences Research Council (NSERC #203475)and Canada Foundation for Innovation (CFI #29272, #225404 and #33536). MatrixDBwas supported by the Fondation pour la Recherche Médicale (grant DBI20141231336 toS.R.B.) and by the Institut Français de Bioinformatique (GlycoMatrixDB project- AAPIFB 2015 to S.R.B.). InnateDB was previously supported by Genome BC through thePathogenomics of Innate Immunity (PI2) project and by the Foundation for the NationalInstitutes of Health and the Canadian Institutes of Health Research under the GrandChallenges in Global Health Research Initiative [Grand Challenges ID: 419]. Furtherfunding was also provided by AllerGen grants 12ASI1 and 12B&B2. D.J.L. is currentlysupported by an EMBL Australia Group Leader award. The Database of InteractingProteins (DIP) was supported by National Institute of General Medical Sciences grantR01GM123126.

Author contributionsS.O. and P.P. wrote this manuscript with revisions by A.B., H.H. D.J.L, I.J, L.L., R.C.L.and M.P.; P.P., A.B., M.D., G.C., M.I., L.L., M.K., B.M., B.N., C.P., K.P., L.P., S.P., N.R.,

NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19942-z PERSPECTIVE

NATURE COMMUNICATIONS | (2020) 11:6144 | https://doi.org/10.1038/s41467-020-19942-z | www.nature.com/naturecommunications 11

Page 12: Towards a unified open access dataset of molecular interactions

P.R., S.R.-B., G.S. and S.O. contributed to the dataset generation; and E.B., N.d.-T., A.S.and L.S. contributed to its maintenance and release.

Competing interestsThe authors declare no competing interests.

Additional informationCorrespondence and requests for materials should be addressed to S.O.

Reprints and permission information is available at http://www.nature.com/reprints

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations.

Open Access This article is licensed under a Creative CommonsAttribution 4.0 International License, which permits use, sharing,

adaptation, distribution and reproduction in any medium or format, as long as you giveappropriate credit to the original author(s) and the source, provide a link to the CreativeCommons license, and indicate if changes were made. The images or other third partymaterial in this article are included in the article’s Creative Commons license, unlessindicated otherwise in a credit line to the material. If material is not included in thearticle’s Creative Commons license and your intended use is not permitted by statutoryregulation or exceeds the permitted use, you will need to obtain permission directly fromthe copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

© The Author(s) 2020

PERSPECTIVE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19942-z

12 NATURE COMMUNICATIONS | (2020) 11:6144 | https://doi.org/10.1038/s41467-020-19942-z | www.nature.com/naturecommunications