Top Banner
R. Meersman, Z. Tari, P. Herrero et al. (Eds.): OTM Workshops 2006, LNCS 4277, pp. 710 719, 2006. © Springer-Verlag Berlin Heidelberg 2006 Reactome – A Knowledgebase of Biological Pathways Esther Schmidt, Ewan Birney, David Croft, Bernard de Bono, Peter D'Eustachio, Marc Gillespie, Gopal Gopinath, Bijay Jassal, Suzanna Lewis, Lisa Matthews, Lincoln Stein, Imre Vastrik, and Guanming Wu European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK {eschmidt, birney, croft, bdb, bj1, vastrik}@ebi.ac.uk Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA {eustachi, gillespm, gopinath, lisa.matthews, lstein, wugm}@cshl.edu Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, California, USA [email protected] Abstract. Reactome (www.reactome.org) is a curated database describing very diverse biological processes in a computationally accessible format. The data is provided by experts in the field and subject to a peer review process. The core unit of the Reactome data model is the reaction. The entities participating in reactions form a network of biological interactions. Reactions are grouped into pathways. Reactome data are cross-referenced to a wide selection of publically available databases (such as UniProt, Ensembl, GO, PubMed), facilitating overall integration of biological data. In addition to the manually curated, mainly human reactions, electronically inferred reactions to a wide range of other species, are presented on the website. All Reactome reactions are displayed as arrows on a Reactionmap. The Skypainter tool allows visualisation of user-supplied data by colouring the Reactionmap. Reactome data are freely available and can be downloaded in a number of formats. Keywords: knowledgebase, pathways, biological processes. 1 Introduction The Human Genome Project has provided us with huge amounts of data, including a good approximation of the encoded components that make up a living cell [1]. This was an important step, but now another challenge is the question how all these components actually interact in a cell, and how they bring about the many processes essential for life. A lot of these individual processes have already been studied extensively, but the
10

Reactome: a knowledgebase of biological pathways

Mar 01, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Reactome: a knowledgebase of biological pathways

R. Meersman, Z. Tari, P. Herrero et al. (Eds.): OTM Workshops 2006, LNCS 4277, pp. 710 – 719, 2006. © Springer-Verlag Berlin Heidelberg 2006

Reactome – A Knowledgebase of Biological Pathways

Esther Schmidt, Ewan Birney, David Croft, Bernard de Bono, Peter D'Eustachio, Marc Gillespie, Gopal Gopinath, Bijay Jassal,

Suzanna Lewis, Lisa Matthews, Lincoln Stein, Imre Vastrik, and Guanming Wu

European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome

Campus, Hinxton, Cambridgeshire, CB10 1SD, UK

{eschmidt, birney, croft, bdb, bj1, vastrik}@ebi.ac.uk

Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA {eustachi, gillespm, gopinath,

lisa.matthews, lstein, wugm}@cshl.edu Department of Molecular and Cell Biology,

University of California Berkeley, Berkeley, California, USA

[email protected]

Abstract. Reactome (www.reactome.org) is a curated database describing very diverse biological processes in a computationally accessible format. The data is provided by experts in the field and subject to a peer review process. The core unit of the Reactome data model is the reaction. The entities participating in reactions form a network of biological interactions. Reactions are grouped into pathways. Reactome data are cross-referenced to a wide selection of publically available databases (such as UniProt, Ensembl, GO, PubMed), facilitating overall integration of biological data. In addition to the manually curated, mainly human reactions, electronically inferred reactions to a wide range of other species, are presented on the website. All Reactome reactions are displayed as arrows on a Reactionmap. The Skypainter tool allows visualisation of user-supplied data by colouring the Reactionmap. Reactome data are freely available and can be downloaded in a number of formats.

Keywords: knowledgebase, pathways, biological processes.

1 Introduction

The Human Genome Project has provided us with huge amounts of data, including a good approximation of the encoded components that make up a living cell [1]. This was an important step, but now another challenge is the question how all these components actually interact in a cell, and how they bring about the many processes essential for life. A lot of these individual processes have already been studied extensively, but the

Page 2: Reactome: a knowledgebase of biological pathways

Reactome – A Knowledgebase of Biological Pathways 711

resulting information is spread throughout the scientific literature. There is an urgent need to bring this information together, allowing the scientist to see connections and understand dependencies in this complex network of interacting entities.

The Reactome database has been developed to provide such a platform, not only to collect information in one place, but also to present it in a systematic, computationally accessible manner. An important aspect of this project is the free availability of all data, and the integration with other publically available databases in order to enhance the potential for the user to extract as much relevant information as possible about a process of interest.

2 Contents and Curation Process

Reactome is aiming at comprehensive coverage of human cellular processes, be it metabolic reactions, catalyzed by an enzyme, or be it complex formation, transport across a membrane, DNA repair or signal transduction. At present, topics included are apoptosis, cell cycle, transcription, mRNA processing, translation, post-translational modification, signalling pathways (insulin, notch), hemostasis, energy metabolism (TCA cycle, glycolysis), amino acid, lipid and nucleotide metabolism as well as xenobiotic metabolism (see Fig.1).

Reactome is a manually curated database. Data are obtained directly from experts. Suitable topics for inclusion in Reactome are identified and independent researchers who are recognized in the field are approached. A Reactome curator and the expert then work together to agree on an outline and structure the data to conform to the Reactome data model. A major emphasis is put on confirming the exact identity of the entities involved by assigning the appropriate identifiers from UniProt [2] or ChEBI (www.ebi.ac.uk/chebi/), respectively. All reactions need to be backed up by a literature reference, stating the Pubmed identifier whenever available. The curator also makes sure that appropriate Gene Ontology (GO) [3] terms are cross-referenced for catalytic activities, cellular locations and biological processes. This curation process is followed by an internal review to ensure consistency, and peer review by another expert familiar with the topic.

3 Website

The front page of the Reactome website (Fig.1) displays a Reactionmap, followed by a section listing the high-level topics represented in Reactome, and a section giving some general information on the project including latest news. In the Reactionmap each reaction is represented as an arrow. The topics are arranged as distinct patterns in the map so that the user can easily identify his area of interest. A separate Reactionmap can be displayed for each species, providing a quick overview as to which pathways are present or absent in a given species. Mousing over the Reactionmap or the topic section highlights the corresponding area in the other section, and both can serve as entry points into the detailed content pages of Reactome.

Page 3: Reactome: a knowledgebase of biological pathways

712 E. Schmidt et al.

The detailed content pages (Fig.2) present the event hierarchy, display diagrams indicating the entities involved in the event as well as preceding and following events, and are often accompanied by an author-supplied illustration. A textual description of the event, literature references as well as species and compartment details are given. Orthologous events in other species and relevant GO biological process terms are found here as well. Physical entities are given along with relevant links to publically available databases such as UniProt, Ensembl [4], KEGG [5], ChEBI, etc. Internal links take the user to more detailed pages on the entity in question, for example giving information on the composition of a complex.

Fig. 1. The Reactome front page. The list of topics in Reactome and the Reactionmap are shown for human by default. Electronically inferred events for 22 other species can be displayed by choosing the species from the drop down menu above the Reactionmap.

Other features of the website include a User Guide, information on the data model as well as citing and linking to Reactome, and an editorial calendar, indicating topics planned for inclusion in the future. Simple searches can be performed on every content page, and there is an extended search option for users familiar with the data model.

Page 4: Reactome: a knowledgebase of biological pathways

Reactome – A Knowledgebase of Biological Pathways 713

Fig. 2. A detailed content page

Fig. 3. A Skypainter result page for time series data

Page 5: Reactome: a knowledgebase of biological pathways

714 E. Schmidt et al.

The Skypainter tool (Fig.3) gives the user the option of submitting a list of identifiers, for example from a microarray experiment, in order to visualise reactions that match these identifiers on the Reactionmap. A statistical analysis based on the hypergeometric test (http://en.wikipedia.org/wiki/Hypergeometric_distribution), is performed to colorise pathways according to the statistical likelihood that they would contain the listed genes by chance. This highlights those pathways in which the uploaded genes are overrepresented. A large number of gene identifiers, including EntrezGene names, accession numbers and Affymetrix probe sets can be recognised by the Skypainter. It also accepts numeric values, such as expression levels from a microarray experiment. For example, a researcher who is using a microarray to compare a cancerous tissue to a normal control can upload the intensity values from the two experiments to the Skypainter, and it will colorise the Reactionmap with red and green to indicate which reactions have genes that are increased or decreased in the malignant cells relative to the normal controls. When submitting time series data, the changes across the Reactionmap can be displayed in an animation, showing the coloured Reactionmaps in succession. Reactions hit by an identifier are listed below the reaction map, providing links to the corresponding detailed content pages.

4 Data Model

The Reactome data model has been developed with the view to allow representation of a variety of different cellular processes: In metabolic reactions, a chemical entity is transformed into a different chemical entity by an enzyme acting as catalyst; molecules are transported from one cellular compartment into another across membranes; proteins undergo posttranslational modifications, or form complexes; a protein, modified in one reaction, may act as a catalyst in a following reaction; a chemical entity, produced by a metabolic reaction, may be required as an activator of a reaction in a signaling cascade. Thus, the data model needs to be able not only to describe individual reactions; it needs to deal with different kinds of reactions in a consistent manner so that the relationships and interactions of the entities in this network can be expressed.

Reactome uses a frame-based knowledge representation. Concepts (such as reaction, complex, regulation) are expressed in classes, and the knowledge is represented as instances of these classes. Classes have attributes attached that hold properties of the instances, e.g. the identity of entities participating in a reaction.

One of the main classes of the Reactome data model is the Reaction. Its main attributes are input, output (both accepting instances of the PhysicalEntity class) and catalystActivity (accepting an instance of the CatalystActivity class, which in turn holds a Gene Ontology molecular function term under activity and an instance of the PhysicalEntity class under the physicalEntity attribute). Other attributes include compartment, which holds a Gene Ontology cellular component term to indicate the cellular location of the reaction, literatureReference, summation, figure, species, goBiologicalProcess and precedingEvent. Related reactions in other species are attached via the orthologousEvent and inferredFrom attributes. Reactions are grouped into Pathways, which can again be components of higher-level pathways.

Page 6: Reactome: a knowledgebase of biological pathways

Reactome – A Knowledgebase of Biological Pathways 715

Another important class is the PhysicalEntity. This class is subdivided into GenomeEncodedEntity to hold species-specific molecules, SimpleEntity for other chemical entities such as ATP, Complex and Polymer for entities with more than one component, and EntitySet for groups of entities that can function interchangeably in a given context. Post-translational modifications are expressed through the hasModifiedResidue attribute, which holds an instance of the class ModifiedResidue that in turn contains attributes describing the nature of the modification. Modified proteins are treated as distinct entities from unmodified proteins, and molecules in one cellular compartment are distinct entities from molecules in another compartment.

Such PhysicalEntity instances that represent the same chemical entity in different compartments, or different modified forms of the same protein, share numerous invariant features such as names, molecular structure and links to external databases like UniProt or ChEBI. To enable storage of this shared information in a single place, and to create an explicit link among all the variant forms of what can also be seen as a single chemical entity, Reactome creates instances of the separate ReferenceEntity class. A ReferenceEntity instance captures the invariant features of a molecule. A PhysicalEntity instance is then the combination of a ReferenceEntity attribute (e.g., Glycogen phosphorylase UniProt:P06737) and attributes giving specific conditional information (e.g., localization to the cytosol and phosphorylation on serine residue 14).

5 Quality Assurance

For any database, data consistency is of utmost importance. In order to provide reliable data, a series of quality assurance tests is performed before data is publically released in Reactome. The main features of this procedure are a check for the presence of essential attributes that are mandatory for a given class, as well as a check for imbalances between input and output protein entities. The latter is based on the principle that proteins that act as input for a reaction need to be present in some (possibly modified) form in the output of the reaction as well, except in synthesis or degradation reactions. Other checks ensure consistency in terms of cellular compartments or species origin for the entities involved. These automated checks are performed in addition to the external and internal review processes mentioned above.

6 Orthology Inference

The main focus of Reactome is manual curation of human biological processes. In some cases, the actual experimental evidence for a biological reaction has been demonstrated in another species and the occurrence of this reaction in human can only be inferred by the experts. Reactome deals with this scenario by describing the reaction in the other species, backed up by a literature reference. A human reaction is then described as well, pointing to the reaction in the other species via the inferredFrom relationship. Thus the evidence can always be tracked back to the original experiment.

Page 7: Reactome: a knowledgebase of biological pathways

716 E. Schmidt et al.

In addition to these manually curated events in other species, Reactome also provides electronically inferred reactions. All human reactions that involve at least one protein with known sequence and are not themselves inferred from the other species under consideration, are eligible for orthology inference. Eligible reactions are submitted to an automated protocol, attempting inference to 22 other species. The rationale for orthology inference is that if all proteins involved in a human reaction have an orthologous protein in the other species, the reaction is likely to occur in the other species as well. A Reactome reaction is then created for the other species, with all species-unspecific entities copied over, and species-specific entities replaced by the respective orthologous entities. Such reactions are marked as electronically inferred reactions and point to the manually curated human reaction via the inferredFrom relationship.

Orthology relationships between proteins are obtained from the OrthoMCL database, which provides orthologue groups based on sequence similarity clustering [6] [7].

When applying the strict inference criteria where each protein needs to have an orthologue, reactions involving large complexes often get excluded from inference. To allow for inter-species differences in complex composition, the criteria are relaxed for complexes such that reactions are inferred to the other species when at least 75% of the protein components in a complex have orthologues.

Fig. 4. Result of orthology inference. The percentage of eligible human reactions inferred to each species is given. The total number of eligible human reactions in release 18 was 1590.

Page 8: Reactome: a knowledgebase of biological pathways

Reactome – A Knowledgebase of Biological Pathways 717

The species included in the Reactome inference procedure cover a wide range of phylogenetic groups, and include model organisms such as mouse, drosophila and C.elegans. For mouse, 1442 out of 1590 (91%) of eligible human reactions were inferred for Reactome release 18, while 904 out of 1590 (57%) of eligible reactions were inferred to drosophila and 348 out of 1590 (22%) to E.coli. Fig. 4 shows a graph with inference figures for all species.

Obviously, such electronic predictions need to be considered with caution as sequence similarities don’t necessarily imply functional equivalence. However, they can serve both as a starting point for manual curation in these species and as entry points into the database via protein identifiers from non-human species.

7 Downloads

All Reactome data and software are available free of charge to all users. Various download formats are available via the website. SBML [8] and BioPAX (http://www.biopax.org/index.html) are exchange formats for systems biology and

Fig. 5. Pathway diagram in svg format, generated for FasL/CD95L signaling

Page 9: Reactome: a knowledgebase of biological pathways

718 E. Schmidt et al.

pathway databases, respectively. Textual description of pathways together with illustrations can also be downloaded in pdf format. Programmatically generated pathway diagrams can be saved in scalable vector graphics (SVG) format (Fig.5). Various lists can be obtained, e.g. a list of all protein identifiers involved in a pathway or reaction. The entire database is available as a mysql dump, and the website can be installed locally. Downloads for the data entry tools used by authors and curators are also available.

8 Discussion

There are a number of other human pathway resources available in the public domain. HumanCyc [9] [10] is a database with focus on metabolism, as much of it is computationally created on the basis of the EcoCyc [11] template. The data model is very similar to the Reactome data model. KEGG [5] is a curated database covering metabolic reactions and signal transduction pathways. However, different data models are used to describe these, and therefore no connections can be made between the entities involved across metabolism and signalling. Another drawback is the reliance on Enzyme Commission (EC) numbers for connecting catalysts to metabolic reactions, which can lead to ambiguous or incorrect assignments. Panther Pathways [12] is a collection of curated signaling pathways with a similar data model to Reactome. It differs, though, in its data acquisition approach, consisting of more rapid, but shallower curation. BioCarta (http://www.biocarta.com) and GenMAPP [13] are human pathway resources with an emphasis on data visualisation. Finally, there are the protein interaction databases like BIND [13], MINT [14] and IntAct [15] whose emphasis is on collecting high-throughput protein interaction data rather than describing the ‘mechanics’ of reactions and pathways.

When comparing the Reactome database to these other resources, it is unique in covering a wide variety of pathways found in the cell and in using a uniform data model across these pathways. This enables the user to look at proteins within an entire network of interactions rather than isolated pathways only, allowing the identification of connections that may be missed otherwise.

In conclusion, Reactome is a curated database of biological processes, describing reactions in a systematic, computationally accessible format. Reactome data are crossreferenced extensively to ensure good integration with other publically available databases. User-supplied data can be uploaded and interpreted within the Reactome reaction map. All Reactome data are freely available and can be downloaded in a variety of data formats.

References

1. Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Science and Society: A 2003 Primer (2003)

2. Wu, C.H., Apweiler, R., Bairoch, A., Natale, D.A., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Mazumder, R., O'Donovan, C., Redaschi, N., Suzek, B.: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucl. Acids Res. 34 (2006) D187-D191

Page 10: Reactome: a knowledgebase of biological pathways

Reactome – A Knowledgebase of Biological Pathways 719

3. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: . Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25 (2000) 25-29

4. Birney, E., Andrews, D., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cox, T., Cunningham, F., Curwen, V., Cutts, T., Down, T., Durbin, R., Fernandez-Suarez, X.M., Flicek, P., Graf, S., Hammond, M., Herrero, J., Howe, K., Iyer, V., Jekosch, K., Kahari, A., Kasprzyk, A., Keefe, D., Kokocinski, F., Kulesha, E., London, D., Longden, I., Melsopp, C., Meidl, P., Overduin, B., Parker, A., Proctor, G., Prlic, A., Rae, M., Rios, D., Redmond, S., Schuster, M., Sealy, I., Searle, S., Severin, J., Slater, G., Smedley, D., Smith, J., Stabenau, A., Stalker, J., Trevanion, S., Ureta-Vidal, A., Vogel, J., White, S., Woodwark, C., Hubbard, T.J.: Ensembl. Nucl. Acids Res. 34 (2006) D556-561

5. Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M.: The KEGG resource for deciphering the genome. Nucl. Acids Res. 32 (2004) D277-280

6. Chen, F., Mackey, A.J., Stoeckert, C.J.Jr., Roos, D.S.: OrthoMCL-DB: querying a compre-hensive multi-species collection of ortholog groups. Nucl. Acids Res. 34 (2006) D363-368

7. Li, L., Stoeckert, C.J. Jr., Roos, D.S.: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13 (2003) 2178-2189

8. Hucka, M., Finney, A., Sauro, H.M., Bolouri, H., Doyle, J.C., Kitano, H., Arkin, A.P., Bornstein, B.J., Bray, D., Cornish-Bowden, A., Cuellar, A.A., Dronov, S., Gilles, E.D., Ginkel, M., Gor, V., Goryanin, I.I., Hedley, W.J., Hodgman, T.C., Hofmeyr, J.H., Hunter, P.J., Juty, N.S., Kasberger, J.L., Kremling, A., Kummer, U., Le Novere, N., Loew, L.M., Lucio, D., Mendes, P., Minch, E., Mjolsness, E.D., Nakayama, Y., Nelson, M.R., Nielsen, P.F., Sakurada, T., Schaff, J.C., Shapiro, B.E., Shimizu, T.S., Spence, H.D., Stelling, J., Takahashi, K., Tomita, M., Wagner, J., Wang, J.: The Systems Biology Markup Language (SBML): A Medium for Representation and Exchange of Biochemical Network Models. Bioinformatics 19 (2003) 524-531

9. Caspi, R., Foerster, H., Fulcher, C.A., Hopkinson, R., Ingraham, J., Kaipa, P., Krummenacker, M., Paley, S., Pick, J., Rhee, S.Y., Tissier, C., Zhang, P., Karp, P.: MetaCyc: A multiorganism database of metabolic pathways and enzymes. Nucl. Acids Res. 34 (2006) D511-D516

10. Romero, P., Wagg, J., Green, M.L., Kaiser, D., Krummenacker, M., Karp, P.D.: Computational prediction of human metabolic pathways from the complete human genome. Genome Biol. 6 (2005) R2

11. Karp, P.D., Riley, M., Paley, S.M., Pelligrini-Toole, A.: EcoCyc: an encyclopedia of Escherichia coli genes and metabolism. Nucl. Acids Res. 24 (1996) 32-39

12. Mi, H., Lazareva-Ulitsky, B., Loo, R., Kejariwal, A., Vandergriff, J., Rabkin, S., Guo, N., Muruganujan, A., Doremieux, O., Campbell, M.J., Kitano, H., Thomas, P.D.: The PANTHER database of protein families, subfamilies, functions and pathways. Nucl. Acids Res. 33 (2005) D284-D288

13. Dahlquist, K.D., Salomonis, N., Vranizan, K., Lawlor, S.C., Conklin, B.R.: GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat. Genet. 31 (2002) 19-20

14. Bader, G.D., Betel, D., Hogue, C.W.: BIND: the Biomolecular Interaction Network Database. Nucl. Acids Res. 31 (2003) 248-50

15. Zanzoni, A., Montecchi-Palazzi, L., Quondam, M., Ausiello, G., Helmer-Citterich, M., Cesareni, G.: MINT: a Molecular INTeraction database. FEBS Lett. 513 (2002) 135-140

16. Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Vingron, M., Roechert, B., Roepstorff, P., Valencia, A., Margalit, H., Armstrong, J., Bairoch, A., Cesareni, G., Sherman, D., Apweiler, R.: IntAct: an open source molecular interaction database. Nucl. Acids Res. 32 (2004) D452-455