Department of Computing Science, and the Division of Infection and Immunity, Institute of Biomedical and Life Sciences The Development of Data Standards and a Database to Aid Proteomic Research Andrew Jones Submitted for the degree of Doctor of Philosophy in Computing Science at the University of Glasgow October 2004 c 2004, Andrew Jones
391
Embed
The Development of Data Standards and a Database to Aid ...ajones/JonesThesis.pdf · The Development of Data Standards and a Database to Aid Proteomic Research Andrew Jones Submitted
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
The thesis reports new developments in the area of database support for proteomics exper-
iments. We have developed a proposal for a data standard that will facilitate sharing and
archival of data. We have also developed a database implementing the standard, which is a
prototype of a public repository capable of storing large volumes of data. Our technology
allows for the integration of results from both microarrays and proteomics. The database
has been evaluated in the context of two investigations performed by collaborating biolo-
gists. We have demonstrated that our technology enables the discovery of new results by
facilitating complex queries and providing novel visualisations of experimental data.
i
Thesis statement
This work will highlight the requirements of proteomic research for standard formats and
centralised databases that allow results to be well annotated and queried. We have developed
a proposal for a data standard, and a prototype of a public repository, and the thesis will
demonstrate how they facilitate the research process.
ii
Declaration
I declare that this thesis describes my own work, that it has not been accepted in a pre-
vious application for a degree, and that all sources of information have been specifically
acknowledged. The work reported in Chapters 4 (FGE-OM) and 5 (RAPAD) was initiated
during a two week period I spent at the Computational Biology and Informatics Laboratory,
University of Pennsylvania working with Prof. Chris Stoeckert and Angel Pizarro. During
the two weeks, the framework for FGE-OM was developed and the SQL database schema
for RAPAD was designed. The subsequent development of RAPAD, including refinements
to the schema, the creation of the web interface and software for data visualisation, was
performed by myself at the University of Glasgow.
Chapter 3 contains a revised version of material published in [176]. The material in Chapter
4 has been revised from [175].
Andrew Jones
iii
Thesis Overview
There is a new research paradigm in molecular biology in which large data sets are obtained
about genes and proteins, and the results enable researchers to formulate new hypotheses
about the system they are studying. This methodology is reversed from the classical approach
where an experiment is designed to test a hypothesis. The field of research is collectively
known as functional genomics, as researchers attempt to assign functions to all genes that
can be discovered in the genome sequence. The experiments can also give insights into the
factors that are crucial in particular processes, such as disease, by discovering the differences
between results from a diseased sample and a normal sample. The methods that investigate
protein abundance, interactions and localisation on a large scale are known as proteomics.
Proteomic investigations present significant computational challenges because data sets are
very large and contain heterogeneous information from different laboratories, which could
be useful to researchers working in a variety of domains. The thesis will describe proposals
for data standards for proteomics, and a new relational database, which will alleviate some
of the computational challenges presented by the experiments. The proposals for a standard
should ensure that proteome data can be archived and will be accessible to querying in the
future.
Chapter 1 will describe the experimental techniques of functional genomics, three case
studies of proteomic research and the requirements for central databases and standardisation.
There has been significant work in both bioinformatics, and computing science research, to
improve methods for making data accessible and open to a wide range of queries, which will
lead to the next generation of the Web. Chapter 2 will focus on the new developments in
computing science, and will cover previous work on data standards for life sciences that allow
information to be exchanged between research groups and deposited in central databases.
There are a large number of databases for functional genomics that have different capabilities
and access methods. The chapter will present the challenges in data integration that arise
from the number of different systems that exist. An area that has attracted much recent
iv
v
attention in computing science is ontology development. Ontologies are structured controlled
vocabularies of terms with definitions that describe a domain in a way that ensures there is
a shared understanding of the concepts by different people. An ontology can also contain
rules associated to the terms that allow computer systems to ask logical questions of the
relationships between different parts of a data set. Chapter 2 will describe the ontologies
that currently exist for life sciences.
Chapter 3 will focus on the standardisation of data formats for proteomics. There will
be a description of the previous work in this area, which consists of an object model 1 to
describe the experimental methodology. We have developed an alternative proposal for a
data standard, which was released in October 2003 to describe additional information that
should be captured in a community standard. It is essential that the finalised standard
contains sufficient description of the results, and the methods that were used to obtain data,
to ensure that future re-evaluation and statistical analysis is possible. The chapter will
describe our proposal and will give an overview of the current progress towards a community
accepted standard for proteomics.
There is an established data standard for gene expression studies using microarrays. It
is becoming feasible for researchers to perform both proteomic and microarray investiga-
tions on the same starting samples. In other cases, the results from different investigations
using microarray or proteomic techniques could be integrated, leading to a much better un-
derstanding about the genes and proteins that are important in the sample conditions. We
believe that microarray and proteomic data sets could be integrated more easily, and queried
in parallel, if they have a single shared data standard. Therefore, we have integrated the
microarray standard with the current models of proteomic data to form a single proposal for
a data standard, known as FGE-OM (Functional Genomics Experiment - Object Model),
which will be described in Chapter 4. Chapter 4 will also contain a discussion of the impor-
tance of using ontologies to describe the experimental protocols, to allow future comparison
and querying of different data sets.
We have developed a database for storage of proteomic results, experimental protocols
and details of the biological samples on which the experiments were performed, known as
RAPAD (RNA And Protein Abundance Database), which will be described in Chapter
5. RAPAD is an extension of a microarray database system developed at the University
of Pennsylvania. We have extended a microarray database into proteomics because we
1An object model is a platform independent notation for describing a software system. The importanceof object models for developing data standards will be described in Chapter 2.
vi
hypothesise that data integration across the two fields will be facilitated if the technologies
are captured in a shared database schema and they have a similar user interface. There is
a very close correspondence between FGE-OM and RAPAD, described in Chapter 5, which
allows RAPAD to be used to test that FGE-OM correctly captures the data semantics.
RAPAD also acts as a prototype of a public repository, and demonstrates that proteome
data can be visualised and queried in complex ways using real data sets. Two investigations
are supported by the current implementation of RAPAD, which will be described in Chapters
6 and 7. The investigations allow the core facilities of the database to be evaluated.
Chapter 6 will describe how the database assists an investigation performed in the labo-
ratory of Dr Jonathan Wastling at the Institute of Biomedical and Life Sciences, University
of Glasgow. The investigation aims to discover the proteins that are differentially expressed
in a human cell culture when invaded with the intracellular parasite Toxoplasma gondii, com-
pared with non-invaded cells. The results will enable a better understanding of host-parasite
interactions. The chapter will demonstrate how gene expression and protein abundance
values have been compared in practice.
There will be a description of another project at the Institute of Biomedical and Life
Sciences, which is supported by RAPAD, in Chapter 7. The project is attempting to cat-
alogue all the expressed proteins in the disease-causing parasite Trypanosoma brucei, using
a variety of experimental techniques. The genome sequence is nearing completion but the
level of functional annotation is poor. The proteome catalogue facilitates the genome an-
notation, and the experiments give insights into the dynamic nature of proteins within the
system. Chapter 7 will describe visualisation software written by the author that allows new
conclusions to be drawn from the results.
Chapter 8 will summarise and extend our arguments on standardisation, ontologies and
archiving of data in public repositories. There will be a comparison of our approach with
alternative methods that could have been employed. There will be a description of the work
that is still required to solve the research challenges that follow directly from the thesis, and
a summary of our contribution.
There are four appendices at the end of the thesis. The first, Appendix A, will describe
an investigation performed by the author into indexing large collections of biological data
represented in Extensible Markup Language (XML), as an alternative to relational database
storage. Appendix B contains detailed diagrams of FGE-OM, which supplement the work
presented in Chapter 4. The RAPAD database schema is included in Appendix C. Finally,
vii
Appendix D will describe how difference gel electrophoresis data can be represented in Gla-
PSI, FGE-OM and RAPAD.
Acknowledgements
I give thanks to my supervisors Ela Hunt and Jonathan Wastling. Throughout my PhD, Ela
has given me great support, spending inordinate lengths of time discussing ideas, reading
my work, and giving me encouragement to persevere with my ideas. At the outset of my
research, Jonathan’s enthusiasm was infectious, which gave me great interest in the subject.
I am very grateful to the MRC for funding my research through first an MRes degree, and
then the PhD.
I would like to thank Chris Stoeckert for giving me the opportunity to visit his lab in
Philadelphia, and thanks to Angel Pizarro for giving up so much of his time while I was
there. The time spent in Philadelphia provided a big impetus for my work, for which I am
very grateful. My thanks also to Mike Turner for giving valuable feedback on my work. I give
thanks to Morag Nelson and Anne Faldas, who generated the data I have used in Chapters
6 and 7, for taking time to explain their experiments, for trying out all my software and for
appearing interested when I talk about databases!
Finally, my biggest thanks to my partner Clare, for all her love and support.
1.1 A conceptual view of the data flow in functional genomics. . . . . . . . . . . . 31.2 The data flow in a proteomics experiment. . . . . . . . . . . . . . . . . . . . . 61.3 A sample image from 2-DE separation of proteins from Toxoplasma gondii . . 71.4 A schematic of a difference gel electrophoresis experiment. . . . . . . . . . . . 111.5 An MS trace viewed with Voyager software [339]. . . . . . . . . . . . . . . . . 131.6 A sample trace from tandem mass spectrometry . . . . . . . . . . . . . . . . 151.7 Two dimensional liquid chromatography coupled with MS . . . . . . . . . . . 181.8 The ICAT method for quantitative proteomics . . . . . . . . . . . . . . . . . 201.9 A two dimensional gel highlights possible different phosphorylation states of
Protein disulfide isomerase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.10 A summary of the technique involved in the creation of Affymetrix microarrays 301.11 A summary of Yeast Two-Hybrid experiments . . . . . . . . . . . . . . . . . . 341.12 Affinity methods for assaying protein interactions . . . . . . . . . . . . . . . . 35
2.1 A partial record from the PIR database, in the native PIR format. . . . . . . 452.2 A partial record from the PIR database, released in XML format. . . . . . . . 452.3 An example partial PIR record stored in a relational database . . . . . . . . . 462.4 The main components of a UML class diagram for a hospital computer system. 492.5 The top level of MAGE-OM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.6 The BioMaterial package in MAGE-OM . . . . . . . . . . . . . . . . . . . . . 522.7 A screenshot of the Protege editor displaying the Gene Ontology for Yeast. . 662.8 The entry for actin in the Gene Ontology, displayed in the AmiGo browser . 67
3.1 The data flow in a proteomics experiments. The parts of the analysis coveredby Gla-PSI are boxed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.2 The complete PEDRo model represented in UML . . . . . . . . . . . . . . . . 883.3 The classes that record biological samples in PEDRo . . . . . . . . . . . . . . 893.4 The part of PEDRo covering protein separation techniques . . . . . . . . . . 903.5 The model of MS ionisation and protocol in PEDRo . . . . . . . . . . . . . . 913.6 MS data and database searches modelled in PEDRo . . . . . . . . . . . . . . 913.7 The complete Gla-PSI object model represented as a UML class diagram. . . 933.8 A model of 2-DE data, and a scanned gel image. . . . . . . . . . . . . . . . . 943.9 The classes capture data from image analysis applications, including multiple
analysis across a number of gels. . . . . . . . . . . . . . . . . . . . . . . . . . 953.10 The relationship between spot data (Spot) and identified proteins (Protein) 953.11 Classes for storing difference gel electrophoresis data. . . . . . . . . . . . . . . 963.12 The part of Gla-PSI modelling statistical analysis of a proteomics experiment. 973.13 Several classes are subclasses of Identifiable . . . . . . . . . . . . . . . . . 993.14 A draft version of the main components of PSI-OM. . . . . . . . . . . . . . . 1003.15 Part of PSI-OM showing the relationships between spots identified on a gel
and the corresponding protein records. . . . . . . . . . . . . . . . . . . . . . . 101
xiv
xv
3.16 A draft version of the protein data model in PSI-OM . . . . . . . . . . . . . . 102
4.1 A time line displaying the emergence of microarray and proteomics technology,and the efforts to standardise data formats. . . . . . . . . . . . . . . . . . . . 110
4.2 An overview of the FGE-OM object model. The model is divided into threenamespaces: BioOM, ArrayOM and ProteomicsOM. . . . . . . . . . . . . . . 111
4.3 A screenshot of the term “Age” in the MGED Ontology viewed with OilEd. . 1134.4 A complete listing of the packages within FGE-OM. . . . . . . . . . . . . . . 1154.5 The packages and classes in the BioOM namespace of FGE-OM . . . . . . . . 1164.6 The packages in the ArrayOM namespace . . . . . . . . . . . . . . . . . . . . 1174.7 The ProteomicsOM namespace. . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.8 The ProteinSeparation package . . . . . . . . . . . . . . . . . . . . . . . . . . 1204.9 The ProteomeBioAssay package . . . . . . . . . . . . . . . . . . . . . . . . . . 1214.10 The ProteinData package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1224.11 The model of MS data and protocols, adapted from PEDRo. . . . . . . . . . 1234.12 The ProteinRecord package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1234.13 A workflow for a proteomics experiment involving 2-DE or liquid chromatog-
raphy to separate proteins, followed by MS to identify proteins . . . . . . . . 1254.14 A subset of classes in the QuantitationType package from SysBio-OM . . . . 1264.15 The CommonBioAssayData package from SysBio-OM . . . . . . . . . . . . . 1284.16 The top image shows a small subset of classes from the Measurement package
in SysBio-OM, the lower is the Measurement package in FGE-OM. . . . . . . 1294.17 The Protocol package from SysBio-OM . . . . . . . . . . . . . . . . . . . . . . 1304.18 The BioMaterial package from SysBio-OM. . . . . . . . . . . . . . . . . . . . 1324.19 The BioAssay package from SysBio-OM. . . . . . . . . . . . . . . . . . . . . . 133
5.1 A summary of several workflows in functional genomics to illustrate the re-quirements for data integration. . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.2 A mapping from classes in FGE-OM to database tables in RAPAD. . . . . . 1445.3 The architecture of RAPAD. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1545.4 The user interaction with RAPAD for entering a 2-DE experiment. . . . . . . 1555.5 The interface for entering protocol information into RAPAD. . . . . . . . . . 1575.6 A web page for specifying sources of biological materials . . . . . . . . . . . . 1585.7 A summary of the database schema for storing information about the design
of a study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.8 The database schema for protein separation techniques and the relationships
to the BioAssayTreatment table. . . . . . . . . . . . . . . . . . . . . . . . . . 1605.9 Screenshots for loading 2-DE, scanning and image analysis data into RAPAD 1615.10 The tables present in the database schema store data from gel spots, image
analysis and the scanning of a 2-D gel . . . . . . . . . . . . . . . . . . . . . . 1625.11 The database schema for linking protein records to gel spots . . . . . . . . . . 1625.12 The database schema for mass spectrometry, adapted from PEDRo. . . . . . 1645.13 A screen shot of the 2-D Gel Viewer that provides search capabilities over
protein data and links to MS results . . . . . . . . . . . . . . . . . . . . . . . 1655.14 A form for entering annotation about a gel spot and linking to protein records 1665.15 A table displaying all the proteins identified on a single gel. . . . . . . . . . . 1675.16 The query interface for searching for specific protein records. . . . . . . . . . 168
6.1 The process of matching microarray data to protein abundance data. . . . . . 1866.2 Output from GoMiner, displaying the GO tree browser open for the gene
6.3 Output from FatiGO showing the classification of up and down-regulated pro-teins in the Biological Process branch of GO . . . . . . . . . . . . . . . . . . . 190
6.4 The interface for visualising spots across replicate gels . . . . . . . . . . . . . 1926.5 The interface for displaying data combined across replicates . . . . . . . . . . 1946.6 The protein record for Cathepsin B in RAPAD has external links to various
databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1966.7 The table in RAPAD displaying protein abundance and gene expression values 1986.8 Spots matched to vimentin from infected and non-infected samples . . . . . . 2006.9 Spots matched to actin beta from infected and non-infected samples . . . . . 2026.10 Superoxide dismutase from infected and non-infected samples . . . . . . . . . 2056.11 Potential PTMs of protein disulphide isomerase . . . . . . . . . . . . . . . . . 2076.12 The result of a search for potential post-translational modification of protein
disulphide isomerase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2086.13 A summary page displays all the gels present in the experiment, and a link
exists to display the experimental protocols used for each gel. . . . . . . . . . 210
7.1 The life cycle of Trypanosoma brucei . . . . . . . . . . . . . . . . . . . . . . . 2157.2 An electron micrograph of the bloodstream form of Trypanosoma brucei . . . 2167.3 The span of peptides that have been matched within a protein sequence . . . 2217.4 Protein spots matched to β-tubulin, overlaid with a graphic displaying the
span of peptide hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2247.5 Protein spots matched to α-tubulin, overlaid with a graphic displaying the
span of peptide hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2267.6 Protein spots matched to five different Elongation Factors . . . . . . . . . . . 2287.7 Protein spots matched to Elongation factor 1-α . . . . . . . . . . . . . . . . . 2297.8 Protein spots matched to EF-β and EF (putative) are displayed with the
corresponding span of peptide hits . . . . . . . . . . . . . . . . . . . . . . . . 2307.9 The span of peptide hits for protein spots matched to Elongation Factor 2 . . 2327.10 A multiple alignment of five Hsp 70 protein sequences from T. brucei . . . . . 2347.11 Protein spots matched to five different Hsp70 protein sequences . . . . . . . . 2357.12 The interface for publishing T. brucei proteome data . . . . . . . . . . . . . . 2377.13 A search using the Gel Viewer reveals 100 proteins, annotated as “hypothetical”2387.14 The protein spots that have been matched to different hypothetical proteins . 2397.15 Four spots containing arginine kinase. The MS results for spots 575 and 535
reveal possible modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2427.16 There are four spots that match initiation factor 5, of which possible modifi-
cations were found for spots 554 and 575 . . . . . . . . . . . . . . . . . . . . . 243
8.1 A possible model for future data sharing and exchange . . . . . . . . . . . . . 264
A.1 Index A has four components: the Data Path Tree, Data Stores, XML LocaterLists and an XML Dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
A.2 Index B has four components: the Data Path Tree, Data Stores, the StructureContainer and the XML Dictionary (not shown). . . . . . . . . . . . . . . . . 273
A.3 The method used to implement a join query in Index B is implemented in asix stage algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
A.4 A prototype interface for querying an indexed store of XML data. . . . . . . 278
1.1 Software available for image analysis of 2-D gels. . . . . . . . . . . . . . . . . 91.2 Software available for searching mass spectrometry data. . . . . . . . . . . . . 14
2.1 Summary table displaying features of microarray databases . . . . . . . . . . 62
3.1 A summary of the interviews held with researchers to formulate an under-standing of proteomics research. . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1 The correspondence between gene and protein abundance for HFF cells in-fected with T. gondii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
A.1 Build times in seconds for Index A and B for four different sizes of data set . 274A.2 Summary of query timings for Index A, values are time in seconds . . . . . . 276A.3 Summary of query timings for Index B, with different caching procedures. . . 276
D.1 Experimental plan for Cy labelling of proteins in the DIGE experiment . . . 343
xviii
xix
Commonly used abbreviations
2-DE - Two dimensional gel electrophoresisAPI - Application Programming InterfacecDNA - coding DNAEST - Expressed Sequence TagFG - Functional GenomicsFGE-OM - Functional Genomics Experiment Object ModelGla-PSI - Glasgow proposal for the Proteomics Standards InitiativeGO - Gene OntologyHUPO - Human Proteome OrganisationIPG - Immobilized pH GradientLC-MS - Liquid Chromatography-Mass SpectrometryLIMS - Laboratory Information Management SystemMAGE-ML - Microarray and Gene Expression Markup LanguageMAGE-OM - Microarray and Gene Expression Object ModelMALDI - Matrix-Assisted Laser Desorption IonisationMGED Society - Microarray Gene Expression Data SocietyMIAME - Minimum Information About a Microarray ExperimentMIAPE - Minimum Information About a Proteomics ExperimentMO - MGED OntologymRNA - messenger RNAMS - Mass SpectrometryMW - Molecular weightNMR - Nuclear Magnetic ResonancePEDRo - Proteomics Experiment Data RepositorypI - Isoelectric pointPSI - Proteomics Standards InitiativePSI-OM - Proteomics Standards Initiative Object ModelPSI-Ont - Proteomics Standards Initiative OntologyPTM - Post-Translational ModificationRAD - RNA Abundance DatabaseRAPAD - RNA And Protein Abundance DatabaseRDF - Resource Description FrameworkRDMS - Relational Database Management SystemsRNAi - RNA interferenceSAGE - Serial Analysis of Gene ExpressionTOF - Time Of flightUML - Unified Modeling LanguageURI - Universal Resource IdentifierURL - Uniform Resource LocaterW3C - The World Wide Web ConsortiumXMI - XML Metadata InterchangeXML - Extensible Markup Language
Chapter 1
Investigations in Functional
Genomics
1.1 Introduction
In recent years, the sequencing of the human genome has gained much deserved publicity
[164, 334]. The sequence of man, and all the model organisms, has generated a vast amount
of information about the basis of life at the molecular level. This was only possible due
to progress in the way in which DNA sequencing is performed [276, 305], and the work of
bioinformaticians to produce software that can assemble the huge genome sequences, find
genes and determine similarity between genes in different organisms. We can state to a
reasonable level of accuracy how many genes there are in man (23758 genes are currently
predicted in Ensembl [94]), mouse (26762 in Ensembl), and yeast (approximately 6,000 [190])
and new genomes can be sequenced on relatively short time scales. However, the genome
sequence is only a starting point, the actual DNA sequences comprising the genes tell us
nothing about how living systems function, and what happens when they go wrong, causing
disease. This knowledge requires information about the molecular function performed by
the proteins encoded by every gene, the interaction partners for the proteins, and the subtle
changes that are propagated to the whole system when a protein malfunctions, or is not
present. One of the most conclusive arguments about how far there is to go in molecular
biology is provided by the surprisingly small difference in the total number of genes between
the nematode worm Caenorhabditis elegans (about 20000 [353]) and humans (20000 - 40000
depending on different estimates) [59]. C. elegans contains only 959 cells and the difference
in biological complexity between it and man is vast, yet this is not caused by the number of
genes. We must ask how such a small number of genes in humans gives rise to the number of
different cell types, the complex development of organs and ultimately the intricacy of brain
1
Chapter 1. Investigations in Functional Genomics 2
circuitry that leads to consciousness. The answer must lie in several phenomena: the actual
number of functional proteins being far larger than the number of genes, caused by differential
splicing creating multiple products from a single gene; modifications to proteins that alter
their function; protein interactions giving rise to complex new functions not achieved by
single proteins; and exquisite regulation of when and where genes are expressed. Therefore,
simply assigning one single function to a gene is a major over simplification as it fails to
capture the richness of the whole system, including the possibility for a gene to encode more
than one protein. Furthermore, each protein form may have several different functions in
different physiological locations.
1.1.1 Experimental methodology
A number of new experimental approaches have arisen, to perform large scale analysis of
systems, which have been the result of technological developments, collectively known as
functional genomics (FG). FG involves the analysis of very large data sets, to find the genes
or proteins that are implicated in disease processes or the changes that result from external
stimuli, and to aid efforts to annotate all genes with information about their biological
function. The workflow displayed in Figure 1.1 gives an overview of how different experiments
can be used to gain insights into gene function. FG includes studies that determine gene
expression, protein abundance (in proteomics), protein localisation and others. The different
methods can be classified into seven categories [360], which can be used to assign a function
to a protein by investigating:
• The extent of expression of a protein under different conditions and in different loca-
tions.
• The interaction partners for a protein.
• The gene neighbourhood, including any co-expressed genes, such as bacterial operons.
• The phenotype of the gene knockout.
• The biochemical activity of the protein once isolated.
• Any post-translational modifications that are observed.
• The three dimensional structure of the protein.
The experiments present significant computational challenges due to the vast sizes of data
sets, the heterogeneity in the information generated by each different lab, and the frequency
Chapter 1. Investigations in Functional Genomics 3
Measure relative levelof mRNA expression to identify proteins
Mass Spectrometryscanning microscopeView samples with a
Global gene expression Global protein expression Positional expression profile Metabolite profile
separate by 2−D gelExtract protein and Apply antibodies
to samplesExtract mRNA andapply to microarray Protocols Separate metabolites by
Mass spectrometry or NMRto detect metabolite profiles
gas chromatographyclone fragmentsExtract DNA and
Figure 1.1: A conceptual view of the data flow in functional genomics.
at which new laboratory techniques are developed. It is vital that functional genomics data
sets can be integrated and adequately queried, linked to gene databases, and exchanged
between research groups [322]. This requires: (i) the development of new database tech-
nologies, and (ii) standard formats to which published data must adhere. The focus for the
work presented in the thesis is to address these two questions for proteomic studies.
1.1.2 Systems biology
A new research area in the life sciences is an effort to understand all the components and
interactions that comprise the entire system, so called systems biology. Systems biology
and functional genomics are not synonymous but there is a large overlap between the two
domains. Functional genomics is the acquisition of data about the function of genes on a
large scale, using various technologies. Systems biology is the discipline of trying to order
all the available information into an understanding about how components interact. One
of the main sources of data can be from functional genomics studies, although that alone
is not enough to build up a complete picture of the system. Critically, in many functional
genomic studies there is no information about causality. If a group of genes are up-regulated
under a particular biological condition, it is not possible to say if the genes are regulated
Chapter 1. Investigations in Functional Genomics 4
in response to the condition, or if the condition is caused by the change in gene regulation
[188]. A complete understanding of metabolic pathways requires experiments that assay the
biochemical reactions, such as the flux in the pathways under a certain condition, compared
with the steady state. New technological advances will enable single molecule measurements
and visualisation of molecular interactions that will be crucial to systems biology, by allow-
ing researchers to derive insights into cellular processes at previously impossible resolution.
These new technologies will require significant database support.
1.1.3 Overview
The scope of our work is restricted to developing technology to aid functional genomics
research. The main focus is the development of a database and a data standard for the
proteomic techniques that are used to detect and measure the abundance of proteins in
complex samples, and to integrate these data with results from other types of experiments.
In this chapter, the main techniques in functional genomics research are described, along
with the computational challenges they present. An outline of the experimental techniques
in proteomics, and three case studies that have been performed, is given in Section 1.2. The
experimental techniques that measure gene expression are described in Section 1.3. Other
types of functional genomics research are described in Section 1.4, and a summary of major
functional genomics investigations is given in Section 1.5.
1.2 Proteomics
The proteome of a sample is the complete set of expressed proteins in a sample of interest,
or the entire set of proteins that could be found in an organism. The term “proteomics”
was first used in the mid 1990s to refer to a newly emerging approach of analysing large
numbers of proteins expressed in a sample [345, 349]. Knowledge of the proteins expressed
in a sample can aid understanding the entire system if the functions of proteins are well
understood. Alternatively, proteomics experiment can give insights into the functions of
proteins that have little annotation, for example if a protein is strongly expressed in one
condition compared with another [362]. Researchers aim to define the proteome of a cellular
sample, tissue, organ or organism using various techniques. The proteome is highly dynamic:
the volume of different proteins change, proteins are translocated to different organelles,
chemical modifications alter the behaviour of proteins and protein-protein interactions give
rise to complex new functions. Researchers are often limited to taking a snapshot of the
Chapter 1. Investigations in Functional Genomics 5
system at one time, but as the size of data sets continue to increase, it will be possible to
gain a more complete understanding [137]. Data sets produced by different laboratories may
comprise heterogeneous file formats produced from different sources, which are difficult to
compare, therefore the requirements for bioinformatics support continue to grow. Data sets
must be made publicly accessible, and software must be designed that allows researchers
to perform detailed re-analysis of data, using various statistical packages. This area is the
focus of Chapter 3, which describes our work on the development of a standard data format
for proteomics. A second issue is that there are currently no major public databases for
publishing proteome data sets, although several are in development. In Chapter 5, there
is a description of a database for proteomics that we have developed as a prototype for a
public repository. The database supports two on-going projects at the University of Glasgow,
described in Chapters 6 and 7.
The emergence of proteomics has been achieved through the developments of new tech-
nologies, although still one of the most commonly used approaches is that of protein separa-
tion by two dimensional gel electrophoresis (2-DE). 2-DE was first developed in the 1970s,
and pioneered in the 1980s by Angelika Gorg and colleagues [136], and while 2-DE techniques
have improved, the experimental basis remains the same today [135]. The main technique
for identifying proteins is mass spectrometry (MS), in which there have been major technical
advances, coupled with the development of software, enabling clear identification of proteins,
even in mixed samples. In this section, gel based proteomics are described in Section 1.2.1.
MS techniques are outlined in Section 1.2.2, other proteomics techniques are described in
Section 1.2.3, and investigations into post-translational modifications are outlined in Section
1.2.4.
1.2.1 Gel based proteomics
The majority of proteomics experiments involve a stage of protein separation, followed by a
technique for identifying proteins once isolated from the mixture. One of the most common
processes is the use of gel electrophoresis, coupled with mass spectrometry. Figure 1.2
displays a workflow from an experiment to determine the abundance of a large set of proteins.
Initially, proteins are extracted from a starting sample and solubilised using a protocol that
is dependent upon the origin of sample and the technique used. Proteomics is not restricted
to a particular area of the life sciences, but can be performed on almost any type of biological
substance, such as microbial cultures, tissues, organs, whole organisms and environmental
Chapter 1. Investigations in Functional Genomics 6
Sample B Sample CSample A
Protein Expression Profile
Sequence Database
Protein Identification
Search
Overview of a Proteomics Experiment
DesignExperiment
ID Vol X Y Protein
1 454 23 24
2 222 28 87 abc1
3 12 20 12
4 662 262 101
1 454 23 24
2 222 28 87
3 12 20 12
4 662 262 101
1 454 23 24
2 222 28 87
3 12 20 12
4 662 262 101
ID Vol X Y Protein ID Vol X Y Protein
2D−PAGE
SolubilisationProtein
StatisticalAnalysis
Add protein ID toabundance data
Digest withtrypsin across gels
Compare abundance
Legend
Sample Flow
Data Flow
Image Analysis
MS/MSMALDIMass Spectrometry
Figure 1.2: The data flow in a proteomics experiment.
Chapter 1. Investigations in Functional Genomics 7
pH 4 pH 7
MW
Figure 1.3: A sample image from 2-DE separation of proteins from Toxoplasma gondii (cour-tesy of A. M. Cohen).
samples.
The solubilised protein mixture is applied to an IPG (Immobilized pH Gradient) strip
and an electric current is applied. A protein migrates to a specific position in the pH gradient
where it has no net charge, in a process known as isoelectric focusing. In the second dimen-
sion, the strip is placed on top of a polyacrylamide gel1 and a second current is applied. The
gel contains a denaturing agent, such as SDS, which causes the three dimensional structure
of the protein to unfold, and gives each protein a net negative charge. In this dimension
the proteins migrate into the gel to a distance that is dependent on their molecular weight.
Smaller proteins migrate furthest and tend to appear at the bottom of gels in most images.
The proteins can be visualised by staining (Figure 1.3). Different IPG strips can be used
to separate proteins with different pI (isoelectric point) values, for example a standard IPG
strip may separate on a 4 - 7 pH gradient. However, to achieve finer resolution of spots, a 5.5
- 6.5 pH gradient more accurately resolves spots with a charge value in this range. Proteins
with charge values at the extremes of the pH gradient may not be observed on a 2-D gel.
This issue is discussed in the Limitations section.
1The abbreviation 2D-PAGE (Two dimensional PolyAcrylamide Gel Electrophoresis) is often used in theliterature.
Chapter 1. Investigations in Functional Genomics 8
Image analysis and quantification of protein spot volume
A 2-D gel can be stained to visualise proteins, using Coomassie blue or silver (discussed
below), and scanned with a flat bed scanner. The scanned image is analysed with specialised
software that detects properties of protein spots, including their coordinates within the image
and an estimate of the volume of protein in the gel. Coordinates are usually specified as the
central point of a circular spot with a particular diameter, or as a set of boundary points
that specify the exact shape of the spot in two dimensions. The volume is estimated from the
darkness of each pixel within the spot. Different software packages have different methods
for quantifying the volume of protein in a spot, and most apply a strategy to normalise the
values across the gel, or a set of replicate gels. The software can match spots produced on
different gels which correspond to the same protein, and determine the relative difference in
the spot size and intensity across two or more gels. One problem that arises is that there
has been little work comparing the algorithms used for quantifying protein spots, or on the
relationship between the amount of visible spot and the actual volume of protein, which
is dependent upon the stain used. Generally fluorescent dyes give the best sensitivity and
linearity. Other stains include Coomassie blue and silver staining. Silver stains allow lower
volumes of protein to be visualised, but there is poor digestion of silver stained proteins with
trypsin and the stains are notoriously non-linear. Coomassie blue offers reasonable linearity
[200] and is widely used due to low cost, although it is less sensitive than either silver staining
or fluorescent dyes.
A goal of computational research is to perform analysis of protein abundance values from
2-D gels produced by different laboratories, as is happening in the microarray field [86].
However, this cannot occur without significant efforts to determine how different software
packages perform gel image analysis. The ProteomeGRID is attempting this kind of anal-
ysis by creating an automated infrastructure for analysing and comparing 2-D gels, using
high performance distributed computing [256]. Large scale analysis of images from different
sources requires software companies to have an open approach to the algorithms or statis-
tical techniques offered by their software, or they must collaborate to create a standardised
output. An alternative would be for researchers to release the original high-resolution scans
of images, in addition to lists of protein volumes, to enable future re-evaluation of large
collections of images in a single analysis. One analysis has been performed to compare the
quality of spot detection in two software packages (Z3 [366] and Melanie 3 [210]) [264]. It
was discovered that both perform reasonably well at detecting spots (approximately 90%
Chapter 1. Investigations in Functional Genomics 9
• ImageMaster published by Amersham Biosciences,http://www.amershambiosciences.com
• Melanie 4 - developed at the Swiss Institute for Bioinformatics,http://ca.expasy.org/melanie/
• DeCyder published by Amersham Biosciences,http://www.amershambiosciences.com
• PDQuest published by Bio-Rad, http://www.bio-rad.com/
• Z3 published by Compugen, http://www.2dgels.com/
• ProGenesis published by Prolific, Inc.,http://www.prolificinc.com/progenesis.html
• Delta 2D published by Bio Imaging,http://www.raytest.de/bio imaging/products/delta2D/delta2d.html
Table 1.1: Software available for image analysis of 2-D gels.
accuracy), and moderately well for detecting ratios of volumes where the ratio is not great
(less than 1:6). A more detailed analysis is required of all the different software packages that
perform image analysis. This work is beyond our scope, but a list of the software packages
available for image analysis is given in Table 1.1.
In the current situation there is little quality control over protein volume values, therefore
the values have limited scope outside of the original experiment. There have been several
efforts to automate the process of comparing large collections of gel images, such as Veeser
et al. 2001 [331] and Rogers et al. 2003 [272]. These efforts are similar to the comparisons
that are being performed across large numbers of microarrays to detect patterns of gene
expression [86, 319], however there are several challenges that must be overcome before large
scale comparisons can be made over 2-D gels. There is variability in the appearance of
gel spots, causing difficulties matching spots across a series of gels [338], different staining
protocols affect the signal strength, and errors can be made in correct protein identification.
A review of current progress in the area of algorithms for detecting and quantifying protein
spots is given by Dowsey, Dunn and Yang [83]. This is an area in which significant future
research is required.
Difference gel electrophoresis
A major new technology in gel based proteomics is two-dimensional difference gel elec-
trophoresis [327], or DIGE2, in which two samples are labelled with different fluorescent
dyes, mixed and separated on a single gel. The gel is scanned at different wavelengths,
2Ettan DIGETM: Fluorescence 2D Difference Gel Electrophoresis [98] produced by Amersham Biosciences.
Chapter 1. Investigations in Functional Genomics 10
creating two images that can be compared. This removes the variability in resolving spots
on different gels thereby improving the matching of spots between gels. The system can
be adapted to use three dyes. The third dye is used to label a mixture of proteins formed
by pooling the two samples in the experiment, to improve normalisation of protein vol-
umes between different images, allowing smaller changes in protein level to be determined
as significant (Figure 1.4) [8].
Limitations of gel electrophoresis
There are several limitations of 2-DE technology. Firstly, membrane and nuclear proteins
tend to be highly hydrophobic and difficult to solubilise, therefore they often do not appear
on a gel [3]. Secondly, high molecular weight proteins do not migrate well through gels and
may not be detected. Thirdly, 2-DE tends to detect high abundance proteins and many
functionally important proteins may be present only in small quantities. Finally, it is fairly
common for multiple proteins to co-migrate to the same spot, causing problems quantifying
the volume of individual proteins. However, this limitation can be avoided by the use of
narrow range pH gels, or zoom gels that improve the resolution of gel spots. Another
advance in gel electrophoresis is sample prefractionation. A protocol reported by Zuo and
Speicher in 2002 [370] can resolve complex mixtures of proteins by first separating proteins
into separate pools based on the charge of proteins. Each fraction of the sample is analysed
by 2-DE, performed over several overlapping narrow range pH gradient gels. This technique
allows more low abundance proteins to be detected as there is a general improvement in
spot resolution, and high abundance proteins are less likely to mask or interfere with other
protein spots. The detection of membrane proteins by 2-DE has been improved by systematic
analysis of the different variables and constituents of buffers to maximise the solubility of
membrane proteins, allowing improved loading of the proteins onto gels [277]. A review
of optimised solubilisation procedures for resolving membrane proteins is given by Molloy
[217]. The poor reproducibility of 2-DE is often discussed as a major limitation, however
the gradual improvements in protocols for the two dimensions mean that reproducibility of
2-DE is now fairly high [317].
Chapter 1. Investigations in Functional Genomics 11
Sample pooling
Sample A Sample B
Extract proteins Extract proteins
Attach blue label Attach green label Attach red label
Recombine samples
Separate by 2−DE
(green) (red)(blue)
Combined Image
Image 1 Image 2 Image 3
Scan gel at three wavelengths
Figure 1.4: A schematic of a difference gel electrophoresis experiment.
Chapter 1. Investigations in Functional Genomics 12
1.2.2 Mass spectrometry
Ionisation types
The most common method of protein identification in proteomics is mass spectrometry (MS,
a review of techniques is given by Mann [203]). In gel based proteomics, a protein spot
is excised from the gel and digested with a protease that cleaves the protein at specific,
predictable positions along its length to form a set of peptides. The most commonly used
protease is trypsin. The peptide mixture can be applied to a matrix and a laser is fired at
a particular wavelength. A matrix is used that absorbs at the chosen wavelength, causing
the proteins to become ionised. This process is matrix-assisted laser desorption ionisation
(MALDI) as developed by Karas and Hillenkamp in the late 1980s [180, 151], which is often
used for identifying proteins in conjunction with gel electrophoresis. An alternate ionisation
approach is electrospray first developed by Fenn and colleagues [102], in which a liquid
containing the peptide mixture is forced through a gold or platinum plated glass capillary
with a fine tip, at a high voltage, causing small droplets to form in a spray. The droplets
evaporate, imparting their charge to the peptides.
Detection
There are various methods for detecting the mass of the peptides that have been ionised.
Time of flight (TOF) is often coupled with MALDI (MALDI-TOF), and functions in the
following way. A laser fires at the matrix, imparting a fixed amount of kinetic energy to the
peptides. The ionised peptides travel through the mass spectrometer and reach the detector
in an amount of time that is dependent on the mass of the peptide, hence smaller peptides
travel faster. Therefore, the mass of each peptide can be determined from the length of time
taken to reach the detector.
A quadrupole detector is commonly used with electrospray ionisation. A quadrupole
consists of four electrically charged rods to which an oscillating current is applied. Pep-
tides travel through the quadrupole but only at a particular amplitude of electric field can
a peptide, of a given mass, reach the detector. Therefore, a range of amplitudes is scanned,
allowing the mass of a peptide to be inferred from the amplitude at that time. A similar
system is the quadrupole ion trap in which ions enter a device that comprises several elec-
trodes trapping the ions inside. Various voltages are applied to the electrodes to eject ions
according to their mass:charge ratios. The ions are focused and detected using an electron
multiplier [177].
Chapter 1. Investigations in Functional Genomics 13
Figure 1.5: An MS trace viewed with Voyager software [339].
A recent advance in detection is Fourier Transform Ion Cyclotron Resonance (FTICR)
mass spectrometry [205]. FTICR can be coupled with both MALDI or electrospray ionisa-
tion and ions are collected in a cell (ICR trap), which is surrounded by a large electromagnet
that causes the ions to resonate. The resonation can be detected by an electrode and con-
verted into a mass:charge ratio, producing a similar spectrum to that produced from TOF
or quadrupole detection.
Data interpretation
Regardless of the method of ionisation, the result is a list of peptide masses on an MS
trace (Figure 1.5). Initially, a noise reduction procedure may be performed on a trace to
remove very weakly detected masses that are unlikely to be the result of genuine peptides.
The software supplied with the mass spectrometer can perform this task automatically but
the researcher may also manually select the strong peaks that they believe correspond to
peptides. The complete set of peptide masses, called the peptide mass fingerprint, can be
used to identify the protein. The list of masses is entered into a search engine that queries a
database of protein sequences, or translated DNA sequences, on which a theoretical digest
is performed. The search engine allows the researcher to specify which protease was used for
digesting the protein and calculates, for every protein in the database, the expected peptide
masses that would result from using that protease. Table 1.2 displays some of the software
that is available for searching peptide mass data. The software finds the proteins in the
Chapter 1. Investigations in Functional Genomics 14
• PROWL - http://prowl.rockefeller.edu/
• MOWSE - http://srs.hgmp.mrc.ac.uk/cgi-bin/mowse
• ProteinProspector - http://prospector.ucsf.edu/
• MASCOT - http://www.matrixscience.com
• SEQUEST - http://fields.scripps.edu/sequest/
• PepMAPPER - http://wolf.bms.umist.ac.uk/mapper/
Table 1.2: Software available for searching mass spectrometry data.
database that have a set of predicted peptide masses that match most closely the observed
peptide masses. The software produces output that includes a statistical score indicating the
likelihood of a correct match, the number of peptides matched, and the percentage coverage
of the peptides matched out of the entire protein sequence. Each value has a statistical
basis, but the researcher uses a combination of these measures that is dependent on various
criteria, to decide if a protein has been correctly matched. In some cases, obtaining complete
coverage of the proteome may be of primary importance, and a low threshold will be used
that allows some false positives. In other situations, finding the exact identity of a single
protein is crucial and a high threshold will be used.
Tandem mass spectrometry
The peptide mass fingerprint method does not always identify a protein with sufficient confi-
dence. In these cases, an alternative approach called tandem mass spectrometry, or MS/MS,
can be used. MS/MS is so called because it involves two sequential MS stages. The first
stage separates proteins into different peptides by their mass but, rather than the ionised
peptide hitting a detector, a peptide is selected, and it is collided with an inert gas such as
argon or nitrogen. The collision causes the bonds between amino acids to split, resulting in
a range of ionised fragments. The mass of each ionised fragment is detected in the second
MS stage. For example, if the selected peptide contains eight amino acids, the fragmentation
would produce new peptides containing 8, 7, 6, 5, 4 amino acids and so on in the second
stage. The difference in mass between each new peptide corresponds with the exact mass of
the amino acids that is lost between the two peptides. The masses of the fragments can be
read from right to left on a trace, revealing the amino acid sequence of the peptide (Figure
1.6). The peptide sequence, or the set of masses from the MS/MS trace, can then be searched
against a sequence database to find an exact match (or near exact) that will conclusively
identify the protein.
Chapter 1. Investigations in Functional Genomics 15
Figure 1.6: Three traces from a tandem mass spectrometry experiment, reproduced from[189]. Image a displays the first MS stage from which the two strongest peptides are selectedfor fragmentation. The results of the second stage fragmentation are shown in (b) and (c).The difference between the mass of the peaks, shown on the y-axis, corresponds to the massof the individual amino acids that form the peptide sequences shown.
Chapter 1. Investigations in Functional Genomics 16
Standardising mass spectrometry
One of the major limitations of MS is that there is neither any standardisation across the
methods employed by different instruments to measure peptide masses, nor in the input
parameters for the instruments. One effort to remedy this situation is provided in a study
by Purvine and colleagues [259]. They created a standard mixture of peptides and proteins,
which they assayed by liquid chromatography and MS (LC-MS), coupled with a database
search engine. The system correctly identified 23 peptides and 12 proteins from the mixture.
The experimental methodology has been released as a standard for assessing the quality of
studies, to see how effectively other systems can identify different proteins from within the
mixture.
The peak list generated from MS is usually entered into a search engine to identify the
protein. Each different application has its own measures of the quality of a protein match
and a researcher often decides, using a combination of measures, whether an identification is
correct. The measures of correct matching often depend upon the software being used, and a
cut-off is determined by each laboratory, using their own criteria that depend upon the type
of experiment. This means that there is no standard method for comparing the likelihood
of a correct match between data produced from different laboratory setups. Therefore, it
is very difficult to ascertain in large data sets the statistical probability that a protein has
been correctly identified. The efforts of the Proteomics Standards Initiative to solve some of
the standardisation problems are described in the following chapter.
1.2.3 Other proteomics techniques
One of the main criticisms of 2-DE based proteomics is the unreliability of estimates of
protein volume made by image analysis. The stain used to visualise protein spots greatly
affects the linearity of the relationship between true protein volume and the spot density
measured by analysis software. There is little information in the published literature about
the accuracy of measurement of protein volumes, therefore in the past results have often
been qualitative: spots are present on one gel and absent on another, or clearly up or down
regulated with large fold differences. However, recent advances in staining or labelling of
proteins, such as DIGE analysis, and improvements in software have enabled quantitative
measurement of protein volume from 2-D gels [181, 328]. In the microarray domain there
has been substantial work on the quantification and statistical analysis of results to be able
to say what differences are statistically significant (examples include [130, 267, 312]). The
Chapter 1. Investigations in Functional Genomics 17
interpretation of results would be easier if quantitative analysis of proteomics data sets could
be performed. Towards this goal a set of new experimental techniques have been devised for
quantifying protein volumes in samples, as described below.
A limitation of 2-DE based proteomics is that highly abundant proteins are identified
much more readily than low abundance proteins. Many functionally significant proteins,
such as transcription factors, are present in low copy number in the cell, and it is vital
that these proteins can be assayed. Therefore, techniques have been developed that perform
proteomic analysis using separation techniques other than 2-DE, which detect proteins that
are expressed at low levels.
Liquid chromatography and tandem mass spectrometry
A technique has been developed in the labs of John Yates at the Scripps Institute, for identi-
fying large numbers of expressed proteins. This technique is unbiased with regard to protein
volume, protein charge or molecular weight, and can identify membrane proteins [344]. The
technique is known as MudPIT (Multidimensional Protein Identification Technology). Mud-
PIT is a further development of a technique reported in 1999, in which two dimensional
liquid chromatography (LC) is coupled directly with MS (LC-MS, Figure 1.7) [195].
There are many variations in the functionality of LC but the principle is that a solution
containing the proteins or peptides to be separated is applied to a column. The column
contains substances that create a gradient to fractionate the mixture based on the charge or
hydrophobicity of the proteins [290]. Reverse phase (RP) chromatography is often performed
in proteomics, in which a column is filled with an aqueous solution and there is an increasing
gradient of an organic solvent. Different fractions are eluted from the column according
to their hydrophobicity as the gradient of solvent increases. The fractions can be collected
for further separations or analyses, such as mass spectrometry, because RP can be directly
coupled to electrospray ionisation. One of the limitations of this technique is that complex
mixture of proteins, such as the entire proteome of a sample, often cannot be adequately
resolved. This problem can be overcome by performing two-dimensional chromatography
in which two sequential stages are performed, which separate on different properties of the
mixture. The first stage is often ion-exchange chromatography, for instance eluting particular
proteins using different concentrations of KCl in stages, causing proteins or peptides to
separate differentially according to their charge.
MudPIT was used with the SEQUEST software for performing database searches [93] in
Chapter 1. Investigations in Functional Genomics 18
Denaturated protein complex
Identified proteins in complex
Peptides (pH < 3)
2D chromatographic
separation of pepetides
Peptide fragmentation using
tandem mass spectrometry
Computational translation of
tandem mass spectra to amino
acid sequences using genomic
sequences
Figure 1.7: Two dimensional liquid chromatography coupled with MS for identifying largenumbers of proteins from a mixture, reproduced from [195]. Two phases of LC are per-formed: (i) strong cation exchange (SCX) for separating by charge, (ii) reversed phase (RP)separating by hydrophobicity, followed by tandem mass spectrometry.
Chapter 1. Investigations in Functional Genomics 19
a study reported in 2001 [344]. The technique was used to identify almost 1500 proteins from
the Saccharomyces cerevisiae proteome, including proteins with extremes in pI, MW, abun-
dance and hydrophobicity. Many studies have been performed to determine the proteome
of S. cerevisiae by 2-DE and MS, however previous to this analysis the largest study had
resolved only 279 proteins [245]. A later refinement of the process was reported by Peng et al.
2003 [243] in another study of the yeast proteome, using two dimensional chromatography,
coupled with tandem mass spectrometry. The study identified a similar number of proteins,
approximately 1500, and reported a very low rate of false positives (less that 1%).
ICAT
The technique of mass spectrometry for protein identification has been discussed above, but
if performed using a standard protocol, MS does not produce quantitative output. This is
because the height of peaks on a trace are very poorly reproducible, and do not generally cor-
relate well with the amount of protein in the starting sample. In 1999, a new technique was
reported by Gygi and colleagues [142], in which MS was coupled with liquid chromatography
for protein separation, and proteins from two different samples could be compared concur-
rently. The scheme is shown in Figure 1.8 and consists of labelling proteins from two different
conditions with ICAT reagents (Isotope-Coded Affinity Tags). ICAT has a component that
binds cysteine residues in proteins, with an isotopically heavy reagent binding proteins in
one sample, and an isotopically light reagent binding proteins in the other sample. The sam-
ples are combined, and enzymatically cleaved to produce peptides. The ICAT reagent also
includes biotin which allows peptides to be extracted with an avidin affinity column because
avidin binds biotin with a very high affinity. Peptides labelled with the ICAT reagent are
captured in the affinity column. The peptides are then analysed in a mass spectrometer
which reveals a pair of adjacent peaks for each peptide. The adjacent peaks are separated
by a difference of 8 Da, which is the difference in mass between the heavy and light isotope.
The difference in peak height represents the relative volume of protein that was present in
the two samples. At this stage, there is no information about protein identity. The peptides
undergo a second stage of MS, in which peptides are fragmented into amino acids (MS/MS
described above) to reveal the amino acid sequence that in many cases can be used to search
a sequence database, correctly identifying the protein. In the original paper describing the
method, ICAT was used to analyse the volume of proteins in two cultures of yeast growing
in different media. The authors were able to identify subtle changes in protein expression
Chapter 1. Investigations in Functional Genomics 20
Mass/charge
Rel
ativ
e ab
unda
nce
Cell State 1(light ICAT)
Cell State 2(heavy ICAT)
Combine samples andproteolyse
Affinity isolation of
ICAT peptides
MS/MS analysis to identify protein
Peptide B
Mass/charge
Rel
ativ
e ab
unda
nce
Peptide C Peptide D
Peptide A
from sequence of peptide A
Quantify relative protein abundanceby measuring ratio of peaks
Figure 1.8: The ICAT method for quantitative proteomics.
Chapter 1. Investigations in Functional Genomics 21
that correlate well with previously published data. One limitation of this method is that the
reagents bind cysteine residues, and cysteine is one of the rarest amino acids. However, the
first publication about ICAT suggests that the percentage of cysteine-free proteins in yeast
is only 8%.
SILAC
A similar approach for quantifying protein abundance is SILAC (stable isotopic amino acids
in cell culture), presented by Blagoev in 2003 [36]. In this approach, a heavy isotope of
arginine or leucine, labelled with C13, is incorporated into the medium in which cells are
growing. A separate culture is grown in normal medium for a different condition. The
proteins are then extracted, digested with a protease and analysed by mass spectrometry.
Each peptide that contains an arginine residue is represented by a pair of adjacent peaks,
caused by a slight increase in mass of the peptide in the heavy carbon medium. It is expected
that all proteins contain arginine residues. The method was utilised to examine the EGFR
(Epidermal Growth Factor Receptor) pathway. One culture was stimulated with EGF, the
other was not stimulated. The cells from both cultures were lysed, mixed in a 1:1 ratio, and
an affinity column was used to extract proteins that interact with EGFR. The difference
in volumes for proteins implicated in EGFR processes were accurately determined by pairs
of peptides, as for the ICAT method. However, SILAC can only be used for cell cultures
growing in a medium whereas ICAT reagents are used to label the proteins after they have
been extracted from the sample, therefore there are fewer restrictions on the samples that
can be analysed with ICAT.
Other differential labelling strategies
ICAT and SILAC were two of the first procedures reported for labelling proteins to quantify
their abundance on a large scale by mass spectrometry. However there are various other
labels that have been used to create “heavy” and “light” isotopes that can be detected by
MS. An example is iTRAQ (isotope Tags for Relative and Absolute Quantitation) which
functions in a similar way to ICAT but has the advantage that more than two samples can
be compared concurrently [15]. The use of H2O16/18 [214], deuterated hydrogen and various
other tags to amino acid sidechains have also been applied to protein quantification (reviewed
in [284]). It is likely that these methods will begin to overtake gel based quantitation of
protein abundance as they do not suffer from the same limitations in the range of proteins
Chapter 1. Investigations in Functional Genomics 22
that will be identified.
1.2.4 Post-translational modifications
The genome sequence is a static representation of biology, and while it is possible to predict
the amino acid sequence of proteins with a high degree of accuracy, this does not reflect the
complete picture of proteins as functional units in cells. The chemical alteration of proteins,
known as post-translational modification (PTM), is a common phenomenon that occurs in a
time and signal controlled manner. Modifications include the addition and removal of phos-
phate groups (phosphorylation and dephosphorylation), which are well known mechanisms
for controlling the catalytic and signalling activity of proteins [172]. For example, receptor
tyrosine kinases (RTKs) potentiate external signals to the inside of cells. RTKs reside in
cell membranes and, when bound by a ligand, change in conformation, switching on their
kinase activity. The RTK subsequently binds and phosphorylates other proteins within the
cell, transmitting the signal downstream [206].
The addition of carbohydrate molecules to proteins, termed glycosylation, is the most
common type of modification. Analysis of glycosylation, or “glycomics”, describes studies to
find all the carbohydrate molecules produced by a protein, and already 5000 genes have been
assigned as having a potential role in the synthesis of carbohydrates across all the sequences
deposited in GenBank [337]. Other types of modification are acetylation, methylation and
cysteine oxidation. In general, modifications cause proteins to change in conformation, lead-
ing to the protein translocating to another part of the cell, or causing new protein interactions
to form. Modifications play a role in maintaining the tertiary (the 3-D conformation of a
single protein unit) and the quaternary (multi-protein complex) structure of proteins, and
are therefore ultimately associated with function.
Identifying PTMs
Protein modifications can be identified by 2-DE coupled with MS, and various other methods
for their detection have been developed (a review of techniques is given by Mann and Jensen
[204]). Distinct protein spots can be observed on a 2-D gel that correspond to differentially
modified forms of the same protein. Phosphorylation can be observed in the case of different
spots positioned in a horizontal line, due to a change in the protein’s charge (pI) with only a
negligible change in molecular weight (Figure 1.9). Glycosylation of proteins (the addition of
chains of carbohydrates) causes a change in molecular weight and pI, causing variant forms of
Chapter 1. Investigations in Functional Genomics 23
Figure 1.9: A two dimensional gel highlights possible different phosphorylation states ofProtein disulfide isomerase from a human cell line (image courtesy of M. Nelson).
proteins to appear in a diagonal line. MS can conclusively identify modifications, for example
if the peptide mass fingerprint reveals a peptide with a shift in mass that corresponds exactly
to the known mass of a modification type. Tandem mass spectrometry is even more accurate,
and can reveal the exact amino acid position of the modification if one amino acid displays a
characteristic increase in mass. However, there are several problems using this technique on a
large scale. Firstly, phosphopeptides are low in abundance and extract poorly from gel slices.
Secondly, during MALDI-TOF only a proportion of peptides reach the detector, therefore
often they may not be detected. Thirdly, while using electrospray ionisation, phosphorylated
peptides ionise poorly in acidified solvents. Finally, in MS/MS the situation is worse, as only
a few peptides in the entire sequence may be detected, therefore the majority of the protein
sequence is not analysed, and modifications on the rest of the protein are silent.
There are methods for improving the detection of modifications including the use of
affinity columns that bind phosphorylated proteins [103], to enrich for these proteins as they
often occur as a small proportion of the total amount of a single protein. One such method
is Immobilized Metal Affinity Chromatography (IMAC) in which columns are loaded with a
metal ion-containing resin that causes phosphopeptides to bind under acidic conditions [354].
Other techniques used to identify modifications include Western blot analysis whereby pro-
Chapter 1. Investigations in Functional Genomics 24
teins are treated with specific antibodies that are known to bind particular phosphorylation
sites on peptides. The antibodies can be fluorescently labelled, allowing differences in fluo-
rescence signal to detect the amount of phosphorylated protein. A similar approach is the
use of autoradiography, whereby radiolabelled 32P is incorporated into proteins, which can
then be quantified [179].
The development of new techniques means that data sets of PTMs are rapidly increas-
ing in size, and good database support is required to make the information available to
researchers to avoid manual analysis of the literature. One estimate suggests that there are
at least 200,000 published PTMs in PubMed [285]. It is a major research challenge to make
the information on PTMs available in the context of large scale investigations.
1.2.5 Case studies of proteomics research
In this section, examples are given of proteomic investigations we have studied. Chapters
3 and 4 will return to this topic and discuss the development of standard data formats for
proteomics, and Chapter 5 will outline a database system that has been implemented to aid
research.
A major part of the development process of the standard was the capture of the re-
quirements of proteomic research. Three case studies of current research activity at the
University of Glasgow, which use proteomic techniques, have been performed. Two case
studies of research in parasitology are summarised below (Case Studies 1 and 3), which
ultimately contributed to the work described in Chapters 6 and 7. Case study 2 outlines
a collaboration at the Beatson Institute3 with the research group of Prof. Walter Kolch,
investigating the MAP Kinase signalling pathway. The data from case study 2 were not
available for inclusion in RAPAD but the experimental setup was taken into consideration
during the development of the model presented in Chapter 3.
1.2.6 Case study 1
This case study is derived from work with researchers in the field of microbial pathogenesis
[61] at the Institute of Biomedical and Life Sciences, University of Glasgow. The researchers
wish to investigate the changes that occur in the proteome of a human cell line (the host)
during invasion with the parasite Toxoplasma gondii compared with non-infected host cells.
A set of replicate samples are obtained and the proteins are extracted from each sample,
3The Beatson Institute for Cancer Research, www.beatson.gla.ac.uk.
Chapter 1. Investigations in Functional Genomics 25
solubilised and separated by 2-DE. The gels are scanned and image analysis is performed
to match spots on different gels corresponding to the same protein. Protein spots showing
differential expression are extracted from the gel and prepared for MS. Many proteins are
identified conclusively by MS. The next stage involves characterising the large number of
hits that are obtained. There are a large number of Internet accessible resources about
human proteins which can only be searched manually. This process is very time consuming
for a large data set. If database searches could be automated, many more proteins could be
analysed in one study, and greater insights could be made. After a long period of manual
database searching, a significant amount of information is obtained about each protein, but
there is no simple mechanism for summarising or managing the information.
The researchers also wish to identify post-translational control mechanisms, to determine
if a protein expressed during parasite invasion has been modified, compared with the same
protein in non-invaded cells. Potential modifications can be found by 2-DE if a protein
migrates to a different position on one gel compared with another gel, the result of a slight
change in the charge or molecular weight of the protein caused by the modification. The
modification can be positively identified on an MS trace by discovering a peptide with a mass
that is different from the expected value, and the difference corresponds to the mass of an
additional group, such as an extra methyl residue. However, to discover modifications that
are functionally important, the researcher must have information about how the protein is
modified in other conditions. These efforts are hindered because there are no major databases
of MS traces or modifications available. An annotated database, containing a large number
of MS traces, would greatly improve the identification of modifications in two ways. Firstly,
annotated traces for proteins with confirmed modifications could be mined to improve the
algorithms for the detection of modifications in other proteins. Secondly, if a particular
protein already has an entry in the database, differences in the modification pattern could
be highlighted, and investigated further to determine if the modification is significant for the
function of the protein.
1.2.7 Case study 2
This case study was conducted at the Beatson Institute, in collaboration with Prof. Walter
Kolch. A cell line was obtained in which the protein Raf-1 is knocked out. The protein
is known to be involved in major metabolic processes in the MAP kinase pathway [38],
and researchers wish to discover the downstream affects from the loss of Raf-1. Gels are
Chapter 1. Investigations in Functional Genomics 26
run using a difference gel electrophoresis system, labelling proteins from the knockout cell
line with one dye, and from a normal cell line with a different dye. A series of replicates
are run, and the gel images are analysed. The researcher has a number of questions they
wish to pose. For example, which spots show significant differential expression between
the samples, and what the identities of these proteins are. After statistical analysis, two
hundred spots showing the greatest difference in expression are highlighted for further study.
The two hundred spots are robotically picked from the gel and prepared for MS. MS traces
are analysed, peak lists are produced and entered into applications that search genome
databases. The searches identify approximately one hundred and fifty proteins that reside in
databases, of which many have only basic functional annotation. The researcher wishes to
further characterise the proteins by searching other relevant databases, of which about ten
exist. The researcher must manually browse Internet sites to assemble information and read
bibliographic references which takes a number of hours, or up to days, if extensive literature
searches are required, for each protein. Therefore, to characterise all one hundred and fifty
proteins in detail could take several weeks for a single researcher.
Once the proteins have been characterised, the researcher wishes to build a mathematical
model of the changes that occur in the metabolic pathway, caused by the loss of function of
Raf-1. Data for the model are to be drawn from the 2-DE studies, a microarray experiment
that has been carried out by another research group on the same cell line, and biochemical
studies carried out over several decades by many different research groups. The process
of retrieving data from the biochemical studies is extremely laborious because little of the
data reside in accessible databases, therefore extensive literature searches are required. The
microarray data sets have been published by other research groups, and are available on
the Internet, but do not have any information about how the cell lines were cultured. In
addition, the database identifiers (accession numbers) for the features on the microarray do
not match the identifiers of the proteins identified by MS. Therefore, it is not possible to
make any direct comparison with changes observed in the 2-DE studies. The major problems
highlighted by this case study are lack of tools for the integration of data from distributed
databases, and insufficient information stored with published data for it to be re-used.
1.2.8 Case study 3
This study was performed with Prof. Mike Turner at the Institute of Biomedical and Life
Sciences, University of Glasgow, in the context of an investigation to determine the proteome
Chapter 1. Investigations in Functional Genomics 27
of the parasite Trypanosoma brucei. The genome sequence of T. brucei is nearing comple-
tion, but many genes have little functional annotation and it is hypothesised that proteome
investigations can aid the annotation process. The data from this investigation form the
basis for Chapter 7.
Proteomics experiments can aid annotation efforts by conclusively identifying proteins
that are expressed under particular conditions. There are many other examples of published
work in which researchers have used proteomics techniques to catalogue the set of proteins
present in a sample of interest, to determine the entire proteome of particular cell types,
organelles or microorganisms (examples include whole yeast cells [115], the human heart
mitochondrion [316] and the plasma membrane of yeast [223]). The organism being studied
may have no genome sequence, or the sequence may be incomplete, therefore there are
significant problems conclusively identifying spots found on a 2-D gel. In some cases, several
2-D gels may be run to separate proteins within different pH ranges. Spots from the gels
are picked, and prepared for MS. Four scenarios for the results of database searches with
peptide masses obtained from an MS trace are possible:
1. A good match to a sequence in the genome database, with functional annotation.
2. A match with no annotation but with homology to sequences from other organisms.
3. A match with no annotation and no homologous sequences.
4. No match in any genome database, for example if the identification has been made
from an expressed sequence tag (EST) database.
Genome sequencing and annotation work is only partially complete for many organisms,
therefore a major problem arises due to the dynamic nature of the sequence databases. After
the release of each new database version, sequences are more likely to be found in groups
1 and 2. However, it is extremely difficult to identify which sequences have been updated
between database versions and the information cannot be accessed without repeating all
the initial searches. The sequence identifiers may also change between database releases,
therefore automating the process of searching for protein records that have been updated is
a major challenge. The sequence of peptides from an MS/MS experiment can also be used
to discover new genes within the genome, or act as an identifier for genes that previously
had not been sequenced, that fall into category 4.
Chapter 1. Investigations in Functional Genomics 28
1.2.9 Publication of proteomics data
There is a growing body of publications in which researchers have utilised a global approach
to study the proteins in a system. A search of PubMed for the word “proteomics” returns over
3500 articles (July 2004). Articles describing gel based proteomics usually include a printed
image of one or more gel, often with a table containing proteins that have been identified
(example [208]). In some cases, there is a comparative analysis across several conditions and
the ratios of the volume of proteins are displayed in a table (example [129]). Experiments
involving different separation techniques, such as liquid chromatography, coupled with MS
for protein identification often display the chromatograms for the different fractions, and
images of MS traces (examples [369, 361]). The proteins that have been identified are also
usually presented in a table. Most publications reproduce the protocols for MS, and a
reference to the software used for protein identification, but rarely is there any detail about
the input parameters for the software or the version of the database that was searched, and
there is variability in the significance cut-off that was used for protein identifications. It
is therefore often not possible to assess the statistical probability that proteins have been
correctly identified without substantial manual effort.
The data from proteomic studies are usually not open to any kind of automated analysis,
even if publications are reproduced electronically on the Internet. This is because the results
are often embedded within images, which cannot be extracted, or the results are written in
the main body of text, which must be read manually to understand the context. This cannot
be automated using current information retrieval techniques. We focus on the challenges of
making proteome data widely accessible in Chapter 3.
1.3 Gene expression techniques
The techniques described above attempt to assess the status of the proteins within a system.
However, the experiments present technical challenges due to the difficulties of extracting
very low volumes of proteins from the cell. There is also no technique for amplifying the
volume of a protein, which is equivalent to PCR (polymerase chain reaction) for amplifying
nucleic acid sequences. Therefore, in the last decade, techniques have been developed for
assessing how strongly genes are expressed by measuring the messenger RNA (mRNA) levels
produced. These techniques are described in this section.
Chapter 1. Investigations in Functional Genomics 29
1.3.1 The development of microarrays
Microarrays were first developed in the mid 1990s from two different approaches. One of
the first developments in microarrays was achieved by Shalon and colleagues in 1996, who
developed a protocol for attaching DNA fragments to a glass slide, and hybridising two sets
of yeast chromosomes, labelled with different fluorophores [287]. A paper was published later
that year by DeRisi and colleagues outlining how microarrays, formed by spotting cDNA
(coding DNA) onto a slide, can be used to assay gene expression in the context of classifying
differences in human tumour cell lines [76]. A different article was published at the same time
outlining the use of microarrays for detecting mutations in a gene implicated in breast cancer
from a number of patients [145]. Each cDNA “feature” corresponds to the complementary
sequence of the mRNA that is produced for each gene to be assayed.
Affymetrix arrays
An alternative approach was pioneered by the Affymetrix company in which very short (10 -
50 base pairs) stretches of DNA are synthesised on the chip using a technique inherited from
the semi-conductor industry, called photolithography [5]. Short sequences of DNA bases
(oligonucleotides) are synthesised on the chip, one base at a time in specific positions. The
process uses fine masks over the chip that allow light to reach particular positions, which
causes the specific degradation of a “blocking residue” that prevents additional bases be-
ing added to an oligonucleotide chain. The chip is then washed with a solution containing
whichever base (A, C, G or T) is required in the next position at the unmasked oligonu-
cleotide, attached to a new blocking residue. A new mask is applied and the next set of
bases are added (Figure 1.10). In this way, chains of nucleotides can be built up one base at
a time.
Measuring expression
Using either of the two approaches outlined above, the result is a chip or slide containing
up to tens of thousands of reporters. Each reporter detects the level of expression for
one gene. When a gene is expressed, mRNA is produced as a signalling molecule, which
is later translated into a protein, the functional unit in the cell. It is believed that the
relative amount of mRNA in one cell compared with another is indicative of the rate of gene
expression and can give insights into the genes that cause the differences between samples.
Two sets of mRNA from samples produced under different conditions (example: one normal,
Chapter 1. Investigations in Functional Genomics 30
Figure 1.10: A summary of the technique involved in the creation of Affymetrix microarrays,image obtained from [5].
one disease) can be labelled with different fluorescent compounds (one red, one green) and
attached to the array. The ratio of red to green for each reporter gives the difference in
expression for each gene between the two samples. For Affymetrix arrays only one sample
is assayed at a time (a one-colour array), and two different samples must be compared on
two different hybridizations to the chip. Statistical processing is performed to ensure that
values obtained from different assays can be compared. Large changes in expression for a
gene, between a normal and a disease sample, may implicate the gene in the disease process.
Since the early days of research the use of microarrays has grown at a remarkable rate.
A simple search of PubMed for the word “microarray” reveals almost 6000 articles published
since 1996. Each experiment generates a large amount of data, most studies involve many
parallel assays, with each assay containing thousands of data points. Therefore, as a general
estimate, each published study could generate several hundred thousand data points. In ad-
dition, we should also consider the genes’ annotation, experimental protocols, and statistical
processing. The challenges in database support for microarrays are clearly very large. These
requirements were realised by the MGED (Microarray and Gene Expression Data) society
in the late 1990s [42], which was established to improve support for publishing, querying
and exchanging microarray data sets. The issues of data standardisation, and the creation
of public databases, are discussed in the following chapter.
Chapter 1. Investigations in Functional Genomics 31
1.3.2 Serial analysis of gene expression
The technique of serial analysis of gene expression (SAGE) was first reported in 1995 by
Velculescu and colleagues [332] as a method for quantifying the expression of genes, prior
to the invention of microarrays. The basic principle is that short tags (10-14 base pairs),
which uniquely identify the transcript of the gene, are obtained for each gene to be assayed.
A sample is obtained, and the tags are isolated from the transcripts, reverse transcribed
(converting mRNA back into DNA), and concatenated to form a long stretch of DNA. The
newly formed DNA is sequenced, and the number of times each tag appears indicates the
level of expression of each gene. The technique has been successfully used to assay the
expression of over 4000 genes in yeast in 1997, which was one of the first examples of a
technique to perform high-throughput analysis on a whole system [333].
1.4 Other techniques used in functional genomics
The main focus of our research is to improve computational support for proteomics, and to
integrate the results of protein abundance experiments with gene expression values. However,
it is also important that technology can be extended to capture and integrate data from all
types of functional genomics experiment. This section contains a brief overview of other
types of large scale experiments which may yield data needed for functional genomics.
1.4.1 RNA interference
RNA interference (RNAi) is a technique first developed in Caenorhabditis elegans [108]. It
is a powerful method for removing the function of a gene without having to develop genetic
crosses, or engineer complex methods for deleting the gene from the genome. In certain
species, simply injecting the organism with double stranded RNA of the same sequence as
the targeted gene, prevents the gene being translated into protein. The same effect can also
be achieved to a lesser extent using single stranded anti-sense RNA. The resulting phenotype
of the gene knockout allows researchers to assign a function to a gene, as long as the knockout
is not lethal, and this has proved vital for investigating C. elegans. The vast majority of the
predicted 20000 genes have been tested with RNAi. Similar experiments have been performed
in plants, in Drosophila and in the disease causing parasites, trypanosomes (a review is given
by Hannon [147]). There is some evidence that RNAi may be effective in mammalian cells,
although this has not yet been conclusively demonstrated, and the complete mechanism for
Chapter 1. Investigations in Functional Genomics 32
RNAi is not currently understood. However, RNAi is a highly specific technique that allows
researchers to determine the function of genes on a large scale.
1.4.2 Immunohistochemistry
The position of a protein in a cell or tissue can be localised using immunohistochemistry,
which is a widely used technique in molecular biology [69]. A particular protein can be
viewed under a microscope using a specific antibody to which a fluorescent label has been
attached, such as green fluorescent protein (GFP [163]), or a radioactive tag. More generally,
proteins can be visualised in a sample using silver staining.
The position of the protein in the cell can be visualised, and differences in the pattern
of labelled proteins can be used to classify samples. Localisation information may provide
clues to the function of a protein. For example, a protein shown to be highly expressed
in cell membranes may prove to be a transporter or membrane receptor. The technique
can be modified to visualise two proteins concurrently, using two different fluorescent labels
attached to antibodies against the two proteins to be studied. In one study, 75% of the yeast
proteome was analysed by this method, totalling 4156 proteins, allowing researchers to infer
significant functional information [157].
1.4.3 Metabolomics
Proteins and mRNA sequences are not the only molecules that can give information about the
current state of a system. Biological reactions are catalysed by proteins, but the reactants are
in fact small molecules, such as citrate, glucose or NADPH, known as metabolites. Researches
have developed techniques to analyse the metabolites within one system compared with
another, for example to determine the difference between bacterial strains, or to analyse
the critical changes in metabolite concentration during a disease process. The study of the
entire set of metabolites as a diagnostic tool has become known as metabolomics (current
progress is reviewed by Weckwerth 2003 [346]), and the term metabonomics has also been
used. According to Nicholson 2002 [227], metabonomics is the study of metabolic profiles in
vivo in whole organisms, biofluids or tissues.
In theory, mass spectrometry could be used directly to detect the metabolites present
in a sample, by detecting the mass of all the metabolites and comparing with a reference
database. In practice, an additional stage is used to separate metabolites according to their
molecular mass, prior to MS, to increase the resolution. The additional stage can be liquid
Chapter 1. Investigations in Functional Genomics 33
or gas chromatography (GC) [105]. The principle of GC is similar to LC but uses a column
filled with an inert gas, rather than a solution. The mixture undergoes a process that causes
it to become gaseous, and small molecules separate according to a property, such as mass
or charge. There have been several studies that determine the metabolites present in plant
samples using LC/MS or GC/MS, examples include [271, 105, 347].
An alternative approach for determining the metabolome is nuclear magnetic resonance
(NMR) [306]. NMR can detect a fingerprint of the metabolites in a sample that contain 1H,
13C, 15N, or 31P when pulsed with a radio frequency. The atomic nuclei give information
about the chemical environment within a magnetic field. NMR has the advantage over
MS that it is not destructive of the sample, and in some cases can be used in a non-invasive
manner for analysing tissues. This kind of metabolomics is used for diagnostics, to determine
the characteristic fingerprint of the metabolites present in a particular bacterial strain, or a
diseased tissue.
1.4.4 Protein interaction studies
Proteins rarely act as single units in cells, but form complexes with other proteins to create
new functions. It is therefore an essential part of functional genomics research to gain
insights into the interactions partners for proteins. The main experimental techniques for
such studies are summarised here.
One of the main technologies developed in the late 1980s is the Yeast Two-Hybrid system
that works in the following way [107]. The DNA binding domain of a transcription factor A
is fused to protein X, and the activation domain of transcription factor A is separated and
fused to protein Y. Transcription factor A switches on a gene that causes a visible change in
a cell culture, causing cells to grow rapidly, or a particular colour to develop. Transcription
factor A can only switch on the gene if its two domains come into contact, caused by protein
X and protein Y interacting (Figure 1.11). The two-hybrid method has been employed on
a large scale to analyse protein interactions in yeast [323], C. elegans [343] and Helicobacter
pylori [263]. In the study on yeast, researchers plated 192 “bait” proteins, and assayed almost
all of the 6000 predicted proteins as “prey”, revealing 281 protein-protein interactions. The
reverse study was also performed, using all the predicted proteins as bait against a library
of prey proteins, revealing a further 700 protein interactions. This system has proved vital
for determining functionally significant interactions, however it has disadvantages [56]. It is
based on transcriptional activation, thereby forces interaction partners to localise together
Chapter 1. Investigations in Functional Genomics 34
Figure 1.11: A summary of Yeast Two-Hybrid experiments, reproduced from [56].
in the nucleus producing a large number of false positives. Therefore, other methods are
usually required to confirm the interactions identified by Yeast Two-Hybrid analysis. In
addition, the fusion of proteins to the transcription factor domains may block sites required
for interactions, or required for modifications that must occur before interaction, such that
they may be missed.
An alternative method for detecting protein-protein interactions is affinity purification
of multiprotein complexes. In this method a single protein A is fused to a tag that can be
purified using an antibody that is attached to an affinity column. Proteins that bind to A,
forming a complex, can be pulled out. The complex is separated on a one or two dimensional
gel and identified by MS. This system has been used in yeast to identify 3617 interactions
with 493 baits [152]. A similar method is tandem-affinity purification (TAP tagging) in
which protein A is fused to a tag that binds IgG beads in a column [120, 270]. Other proteins
interact with protein A forming a complex. The TAP tag contains a highly specific protease
cleavage site to enable the complex containing protein A to be extracted from the column
without disrupting the interactions. The proteins within the complex can subsequently be
identified by gel electrophoresis and mass spectrometry. The affinity based methods have
the advantage over Yeast Two-Hybrid that interactions take place under conditions that are
much closer to natural cellular conditions, although interactions may not be detected if the
interacting proteins are not in high abundance.
A new advance in understanding protein interactions is the development of protein mi-
Chapter 1. Investigations in Functional Genomics 35
Figure 1.12: Affinity methods for assaying protein interactions, reproduced from [56].
croarrays (or protein chips) [251]. The basic technique involves immobilising a set of recom-
binant proteins to a surface, such as a membrane or slide. The chip can then be assayed with
a protein or antibody attached to a fluorescent molecule. Any protein spot that fluoresces
is likely to be an interaction partner for that protein or antibody. Multiple proteins can be
tested against the chip in sequence, to generate data about protein interactions on a large
scale. There are currently several technical difficulties with the production of protein chips.
However, although protein chips are still at the “proof-of-concept” stage, new techniques
for printing protein spots, immobilising correctly folded proteins and detection should soon
make this technique widely available to researchers, enabling rapid, large scale surveys of
protein interactions.
1.4.5 Three dimensional structures
The three dimensional structure of a protein is one of the most insightful pieces of infor-
mation about its function, particularly if a structure is obtained in which a ligand is bound
to the active site. The resolution of 3-D structures is a major research field and might be
considered outside the scope of functional genomics. However, in recent years an effort has
been initiated to perform high-throughput generation of protein structures, that has been
Chapter 1. Investigations in Functional Genomics 36
termed structural genomics, or structural proteomics [360]. Large collections of recombinant
proteins are screened in parallel for the ability to form crystals, each using a range of ex-
perimental conditions. An early example of the success of this approach was demonstrated
by Christendat and co-workers in 2000, in which 10 structures were published simultane-
ously [57]. In the protein data bank (PDB) there are over 26,000 structures in July 2004
and this number is likely to increase exponentially as the structural proteomics effort gains
momentum.
1.5 Investigations across the “omics”
Large scale investigations are being undertaken in many labs, working on a great range of
organisms. The techniques used depend upon the organism, for example in the nematode
worm, C. elegans, RNAi is one of the best methods for investigating the function of genes (a
review is given by Lee and colleagues [191]). However, RNAi is not a viable method for some
other species. In mice, more common techniques include the development of “knock-out”
mice, whereby targeted recombination replaces a specific gene in embryonic stem (ES) cells.
The ES cells are then injected into blastocysts, which can form embryos when implanted in
a pseudo-pregnant mouse. The resulting litter contains certain mice with the gene knocked-
out, from which a strain of mice can be developed. The phenotype of the resulting strain
gives information about the function of a gene [308].
A summary of the FG approaches that have been used in yeast is given by Castrillo and
Oliver [50]. Yeast has been a very important model organism, and many of the techniques
described in this chapter were first developed in a yeast model. Current investigations in
yeast focus on finding all the genes in the genome, using bioinformatics approaches [190]. In
addition, various high-throughput approaches have been used to study the transcriptome4
[165], proteome [115] and metabolome [9]. Investigations in parasitology form the basis for
the work in chapters 6 and 7, and FG studies on other organisms are too numerous to cover in
detail. However, in the following section a brief description is given of studies in which more
than one type of approach has been used to study a system, such as genome, transcriptome,
and proteome analysis.
4The transcriptome is the complete mRNA abundance of a sample.
Chapter 1. Investigations in Functional Genomics 37
1.5.1 Comparative studies
There are several examples of published work in which researchers have characterised a
biological system by applying more than one type of functional genomics technique, and
in the next few years it will become common for researchers to perform parallel analysis
of the transcriptome, proteome and metabolome. In 1999, two papers reported similar
analysis on yeast to determine the global gene expression and to compare this with protein
abundance data, in an attempt to find the correspondence between the rate of transcription
and translation [115, 143]. The paper by Futcher and colleagues [115] compared protein data
from 2-DE, using LC-MS for identification, against mRNA data from SAGE and microarrays.
The results suggested that the correlation between gene and protein expression is high. They
found that approximately one molecule of mRNA gives rise to 4000 molecules of protein. The
study published early that year by Steven Gygi at the University of Washington compared
data from 2-DE and SAGE [143], and found a very poor correlation between gene expression
and protein abundance [143]. In their study, certain groups of proteins that had the same
level of abundance had mRNA levels that varied 30-fold. Conversely, genes with similar
levels of mRNA produced proteins that varied up to 20-fold in volume. The difference in
the two studies may result from anomalies in the experimental techniques that produced the
data, or the statistical model used to perform the comparison.
A study by Lee and colleagues in 2004 performed comparative analysis of gene expression
and protein abundance in yeast, using microarrays and 2-DE, to establish which genes and
proteins were up-regulated in a particular mutant strain [192]. Fifty-four genes out of 4290
assayed were found to have differential expression assayed by microarrays. Eighteen differ-
entially expressed proteins were observed by comparative 2-DE analysis, of which 14 were
identified by MS. The study revealed that many of the sequences differentially expressed in
both analyses had similar functions, but the overall data sets were too small to perform any
kind of statistical correlation analysis between the rate of transcription and translation. This
study exemplifies the current problems hindering large scale comparison of microarray and
protein abundance results. There are few studies that make protein abundance data publicly
available, and therefore it is difficult to determine how accurately the level of mRNA predicts
the volume of the corresponding protein. For this to be possible, data must be pooled from
several different studies, which requires the deposition of experiments in a public repository,
where the results are formatted in a standard way. Moreover, it is likely that there will
be significant variation in the relationship between mRNA and protein production. This
Chapter 1. Investigations in Functional Genomics 38
might occur both at the protein class level, and at the species level. Thus, the discovery of a
single process to govern transcriptional control of protein production may be unlikely. The
problems of standardisation, and public deposition of data, are addressed in the following
chapter.
There are many large FG studies that are currently being performed on a variety of
organisms, and in the next few years it is likely that studies analysing more than one level of
the central dogma5 will become widespread. It is clear from the studies that microarrays are
a powerful tool for finding genes that have an important role in a process, but single data
points may not be able to predict accurately the abundance of functional protein, if analysed
independently of the entire data set. Protein abundance values may be a more accurate
measure of the amount of functional material but the experiments are less reproducible, and
cannot be performed at the same throughput level as microarrays. Therefore, a combination
of approaches will provide a more complete picture of the status of the system and the data
will feed into models of cellular and physiological processes, allowing the vision of systems
biology (as described in Section 1.1.2) to be realised. The issues involved with integrating
data from microarrays and proteomics are explored in detail in Chapters 5 and 6.
1.6 Summary
The techniques described in this chapter provide insights into gene and protein function,
with new technological developments allowing researchers to generate very large data sets
on a previously unimaginable scale. The monetary cost of such ambitious experiments is
extraordinarily high, since they are dependent on an expanding range of complex machinery
requiring high levels of technical expertise. Therefore, there is an economic requirement
to maximise the amount of information from each experiment, and to provide flexible data
storage capable of repeated interrogation.
An important consideration is how to interpret data from large scale approaches, and how
to place statistical confidence on findings derived from the data. It is critical that more than
one experimental approach is utilised, for example microarray results are often confirmed
by PCR or Northern analysis, and differential expression of proteins can be confirmed using
antibodies in a Western analysis. The combination of results from more than one level of the
“omics”, for example comparing mRNA and protein level, will enable much higher confidence
5“The Central Dogma of Molecular Biology” was proposed by Francis Crick to explain that the informationflow usually ran from DNA to RNA to protein [64].
Chapter 1. Investigations in Functional Genomics 39
to be placed on functional assignments. The data sets will ultimately feed into models that
are used to generate an overview at the level of the whole system. Before this can be achieved,
a significant body of work is required to improve public databases for functional genomics
data, and community wide agreement is required on standard formats to which published
experimental data must conform. An overview of the current work in this area is the focus
of the following chapter.
Chapter 2
Databases, standards and
ontologies for the life sciences
2.1 Introduction
In the previous chapter the techniques that comprise functional genomics research were
described, along with the computational challenges they present. In particular, the focus
was on proteomics research, for which we have developed proposals for a data standard, and
a new database system, described later in the thesis. This chapter contains a description of
the major research developments in database technology for functional genomics (FG) and
other life sciences domains. FG experiments require the development of standard formats
for transferring data between research groups and sending datasets to central repositories.
Ontologies are controlled vocabularies of terms describing a particular domain, and are vital
for data interchange and archiving in FG. Current advances in standards and ontologies are
described.
2.1.1 Computational support for the life sciences
In theory, building good databases for life sciences should be no different from building con-
ventional databases for commerce, banking and industry, however in practice there are a
number of key differences. Relational database management systems (RDMS) have been
designed to support commercial applications with relatively simple data types: most con-
cepts required for a banking database can be represented by strings, integers and floating
point numbers. In addition, this area is standardised to a large extent, as there are well
designed packaged solutions that can be purchased. The huge growth in life sciences data, to
which massive public access is required, presents new challenges to the database community.
Consider that the human genome sequence, even without the annotation of genes, is a set
40
Chapter 2. Databases, standards and ontologies for the life sciences 41
of 3 billion characters, which must be queried in a number of different ways. It is not easy
to query DNA code stored in tables in an RDMS, therefore additional indexes and software
have been designed de novo and run alongside database applications to provide access to
the data. The situation in functional genomics research is even more complex due to the
heterogeneity of data sets produced by different laboratories.
In proteomics, high resolution images of 2-D gels are an integral part of a data set, to
which significant information must be attached. RDMS can store images, but do not offer
any facilities for querying data within images, or any image comparison. As the field of
proteomics is developing rapidly, there are frequent changes and improvements in the types
of experiment, in laboratory equipment and new software. The number of different data
formats that a bench researcher must deal with is large, and providing an integrated view of
all the data within even a single experiment is a challenge. Once a study has been performed,
researchers often spend significant periods of time searching online databases to characterise
genes and proteins that have been highlighted by their study. Each year the Nucleic Acids
Research journal (NAR) has a special issue, the Molecular Biology Database Collection,
describing all the databases that are freely available over the Internet [117]. In 2004, the
collection contained 548 different databases, many of which are relevant to functional ge-
nomics. Most databases can be queried via the Internet, but the results of queries are often
embedded in web pages that are very difficult to process automatically. Alternatively, many
databases offer a download of their entire contents in a bespoke text format that requires
specific software for handling. A complete data set assembled by a researcher could contain a
great variety of file formats, high-resolution images with annotation, experimental protocols
written in lab books, and large quantities of raw and statistically analysed data. It is vital
that experimental data is made available to other research groups. The publication of results
only in journals is no longer sufficient because data sets are simply too large to comprehend
by reading alone. Research is required to develop local databases for laboratory manage-
ment, and centralised public repositories [273]. Standardisation of formats must occur to
enable developers to create software that can process results into a single file that can be
used for sending data to centralised repositories, or to other research groups.
2.1.2 The future accessibility of data
The remarkable growth of the World Wide Web in the last decade has changed the face of
business and research, by enabling information to be made globally accessible, in an instant.
Chapter 2. Databases, standards and ontologies for the life sciences 42
The Web has altered the way scientists publish their data, as almost all journals are now
accessible on the Internet, and can be searched very rapidly with an index. Our libraries
are not yet defunct, but are certainly under threat. This model of Web publishing is still
far from ideal because almost all web pages are intended to be read and understood by
people, and not by computer systems. Additional software has been created to allow the
Web to be searched, but the search engines utilise only a fairly simple index of the text in
web pages, and generally ignore the context. For example, it would be desirable to be able
to find automatically all the databases in the NAR Molecular Biology Database Collection
which contain information about proteins, query them for a specific protein, and summarise
the results. Unfortunately this will not be possible in the near future because there is no
standard mechanism for automatically discovering the types of data stored in a particular
system, or how they can be accessed. The solution to these problems may be found by the
Semantic Web [342], the next generation architecture of the Web.
The Semantic Web has been proposed by Tim Berners-Lee, the founder of the WWW, as
a global network of resources that are machine understandable [31]. The basic premise is that
web sites will be created using technologies that allow them to specify the objects described
in the web pages, the relationships between objects and how the web sites can be accessed.
An essential component will be ontologies, which are controlled vocabularies containing terms
that have a strict definition, and a specified source location, to ensure that a version of a
term is used in different contexts with exactly the same meaning. Ontologies can contain
a set of rules associated with terms, which allow the terms to be processed in computer
systems. Software can discover the relationships between terms, and perform reasoning,
to ask logical questions of a resource described using an ontology [133]. A hypothetical
biological example is as follows. All databases within the NAR Molecular Biology Database
Collection are made accessible through the Semantic Web, using a software package that is
freely available, similar to the HTML editors that are used to produce current web pages.
A database specifies what it contains, such as the three-dimensional structures of proteins,
and that it can be accessed by querying with a URL (Uniform Resource Locater) followed
by the term ?query=PROTEIN NAME. The terms that describe the contents and methods of
accessing the database are obtained from a controlled vocabulary that resides elsewhere on
the Web, to ensure that the same terms are used by different databases. Software can then
be developed that automatically discovers the 3-D structure database, queries it for a protein
name, and processes the results as required by the user.
Chapter 2. Databases, standards and ontologies for the life sciences 43
This has clear implications for biomedical research, and it is one of the areas that will
benefit most from the Semantic Web [173]. The life sciences, unlike the axiom-based sciences,
rely on knowledge acquisition about a domain, and have been subject to an unavoidable
historical bias caused by the interests of the particular researcher investigating an area. The
advent of functional genomics removes much of the bias because, rather than an experiment
being designed to test a hypothesis, the experiment itself generates hypotheses about the
function of genes, proteins or entire systems. The results presented in a journal publication
could still be focused on a researcher’s particular interests, but the whole data sets will often
contain far more information than is highlighted in the original publication, which could be
valuable to many other research groups. The Semantic Web has the potential to maximise
the knowledge derived from a single experiment, by making it as widely accessible as possible.
For a knowledge-based science, clearly this will be a major advance.
The Semantic Web will be built using a number of technologies, of which several al-
ready exist (described in Section 2.2). Extensible Markup Language (XML) has become the
primary notation for exchanging information over the Web, and most standard formats for
the life sciences are expressed in XML. XML itself cannot express how concepts are related
to each other, this functionality is offered by the Resource Description Framework (RDF)
which can describe the location of objects on the Web, and how objects relate to each other.
Finally, the development of ontologies will be vital for ensuring that terminology is used in
a standard way, and various formats for expressing ontologies have been developed. Current
progress in ontologies for biomedical research is presented in Section 2.5.
The vision of the Semantic Web may be realised in the next decade, but in the nearer
term many of the concepts can be applied now, to improve the facilities for data publishing
and exchange. The results of functional genomics experiments must be made accessible in
public databases. Later in the chapter there is a description of the public databases that
currently exist for functional genomics data (Section 2.4), although neither the problem of
developing standard access methods, nor the challenge of data integration (Section 2.6), have
yet been solved. The development of central repositories is not possible without standard
exchange formats that researchers must use to express their data sets. A description of
current developments in data standardisation is also given (Section 2.3).
Chapter 2. Databases, standards and ontologies for the life sciences 44
2.1.3 Guide to the chapter
The structure of the chapter is as follows. The formats used to express data standards and
ontologies are described first (Section 2.2). Since the development of public repositories is a
major challenge without common data formats, previous work in standardisation is described
in Section 2.3. A summary of databases that have been developed for life sciences is presented
in Section 2.4. There are a number of newly established efforts to design ontologies to capture
biological information, described in Section 2.5. Finally, there are major efforts by a number
of research groups to bring all the diverse parts of related information together in common
systems (data integration), described briefly in Section 2.6.
2.2 Technology required for data standards
2.2.1 Extensible Markup Language: XML
The emergence of data standards has been tied to the rise in usage of Extensible Markup
Language [101] (XML) as a data interchange format in e-commerce, industry and research.
The importance of XML for bioinformatics has been recognised for some time [2]. An XML
document has a hierarchy of tagged elements, in which the name of the tag describes the data
type that follows. XML has been described as semi-structured data because the document
is self-describing [44], unlike the tuples1 in a relational database, which have little meaning
in the absence of the database schema. An example of a partial record in the native format
from the PIR (Protein Information Resource) database [254] is given (Figure 2.1), along with
the same data stored in XML (Figure 2.2) and a representation of how the same data could
be stored in a relational database (Figure 2.3).
XML has become the most commonly utilised format for expressing data standards and
ontologies because there are a large number of applications that can automatically process
XML documents [279, 82], unlike bespoke text formats that require processing software to
be re-written every time there is a change to the format. Many life sciences databases now
offer a bulk download in XML format that could be used for data integration, as described in
Section 2.6. Data represented in XML can be validated using a document that specifies what
elements and relationships are allowed in the XML. The current specification for validation
documents is XML Schema [341] that has superseded the initial proposal of the Document
Type Definition (DTD) [75].
1A tuple is a term for a row of data in a table of a relational database.
Chapter 2. Databases, standards and ontologies for the life sciences 45
ENTRY CCHU #type complete iProClass View of CCHU
TITLE cytochrome c [validated] - human
ORGANISM #formal_name Homo sapiens #common_name man
...
SUMMARY #length 105 #molecular_weight 11749
SEQUENCE
5 10 15 20 25 30
1 M G D V E K G K K I F I M K C S Q C H T V E K G G K H K T G
31 P N L H G L F G R K T G Q A P G Y S Y T A A N K N K G I I W
61 G E D T L M E Y L E N P K K Y I P G T K M I F V G I K K K E
91 E R A D L I A Y L K K A T N E
Figure 2.1: A partial record from the PIR database, in the native PIR format.
<ProteinEntry id="CCHU">
<protein>
<name status="validated">cytochrome c [validated]</name>
In this example, the excerpt of RDF describes an article on a web page, specifying that the
author is “Tim Bray” and the home page of the web site is http://www.textuality.com. An
RDF description consists of three components: a Resource, a Property, and a Statement. A
resource is any object that has a Universal Resource Indicator (URI), such as a web page,
or part of an XML document. A property is a resource that has a name, and is a facet
of, or belongs to, another resource. In the example, the author is a property of the article.
A statement is a combination of a resource, property and value, such as The Author OF
http://www.textuality.com/RDF/Why-RDF.html IS Tim Bray.
RDF could be used in the life sciences domain, for instance to describe protein records
in a web accessible database, in which the URI of the record is the resource, and the amino
acid sequence of the protein is a property. The following statement could be deduced auto-
matically:
The Protein Sequence OF www.myProtDB.org/query?myDBId=1A1B IS "MLENT...".
The RDF representation has advantages over a pure XML representation because, while
a person viewing an XML document may be able to deduce that a protein sequence is a
property of a protein record, this could not be done automatically [228]. There are various
biomedical ontologies described below that utilise extensions of RDF. In the field of chemistry
RDF is also used, for example to express the Chemical Markup Language that enables the
interchange of molecular data [220, 131].
2.2.3 DAML+OIL and the Web Ontology Language
The use of ontologies is a major research area in the life sciences. Several examples drawn
from this area are discussed in Section 2.5. There is a formal language for expressing on-
tologies, which was originally called DAML+OIL because it resulted from the fusion of two
separate efforts [154]. It is now set to become the W3C standard OWL (Web Ontology
Language [238]). OWL is expressed in XML and uses the RDF extension. OWL is a further
extension of RDF because it specifies what the associated objects are, and how they are
related, rather than only specifying a single object with a set of properties. An ontology
Chapter 2. Databases, standards and ontologies for the life sciences 48
expressed in OWL consists of axioms that state the formal relationships between classes and
properties.
For example, an ontology describing genes, transcripts and proteins could be defined as
follows. One relationship could be specified: isTranslated, between the class:mRNA (mod-
elling an RNA sequence record) and the class:Protein (for the protein sequence record).
The class:Protein and class:mRNA both have a textual definition that describes exactly
what is meant by the term. This representation is powerful because it allows reasoning to be
carried out by a computer system, in combination with rules over other objects. The software
could find that the protein sequence is created by translating an mRNA sequence. This kind
of reasoning cannot be done in a purely relational database system, because the semantics
of a relationship are usually only captured by a record having a foreign key that references
another table. The meaning of a relationship in a database can be open to interpretation.
A well designed ontology ensures that every concept and relationship has a clearly indicated
meaning [39].
2.2.4 Unified Modeling Language
An important component of a data standard is an object model that describes a system
independently of the technology that is used for its implementation. Object models are
most commonly expressed in Unified Modeling Language [324] (UML), which is a standard
notation designed to improve the process of developing large software systems [274]. UML
includes components that represent the design and visualisation of the architecture of a
system during development. UML supports the definition of “use case” scenarios and work-
flows which could be used to model the biological research process. UML can also be used
for database design.
The most commonly used part of UML for representing a system is the class diagram. A
class diagram represents real world objects as a set of classes with attributes of certain types
(such as strings, integers, or user-defined), and relationships between classes (see Figure 2.4).
The concept of inheritance can also be represented in UML, in which one class inherits all
the attributes and relationships of another class. It is common in class diagrams to see
multiple subclasses inheriting from a single superclass. This design is intended to reduce the
amount of code required to implement the model because the attributes and relationships
only have to be programmed once for the superclass, rather than repeating code for each of
the subclasses. The concept of inheritance is exemplified in the description of MAGE-OM,
Chapter 2. Databases, standards and ontologies for the life sciences 49
Relationship betweenHospital and Ward
DOB: date
name: String
Doctor
telephone: int
Patient
Person
admission: date
A package forgrouping classes
Ward cannot exist without Hospital.A diamond indicates containment e.g.
Open arrow indicates inheritance.Doctor and Patient are subclassesof Person and inherit the attributesname and DOB from the superclass.
1..n1
postcode: String
address: String wardNumber: Int
WardA class representinga real world object
Attribute typeAttribute of Person
Hospital
name: StringStaff
in which the relationship should be implemented.Arrow in a relationship indicates the direction
The numbers refer to the multiplicity of the1 1..n
linked to one or more instances of Ward.relationship. One instance of Hospital can be
Figure 2.4: The main components of a UML class diagram for a hospital computer system.
the object model for microarray experiments (Section 2.3.1).
An object model enables developers to have a shared understanding of the components
of a complex system, but it can also be converted into an XML validation document and a
database schema without significant effort. Another use of UML is to support the design of
code for an entire software system, for instance to provide database connectivity, produce
output in a file format, or describe user interactions with the system.
2.2.5 The object management group
The object management group [231] (OMG) is a consortium formed to improve the interoper-
ability of software systems. The standards defined by OMG are expressed in UML, and other
notations, such as the MetaObject Facility (MOF) [231] . The main component of OMG is
the Model Driven Architecture (MDA). This is a notation for specifying the components of
large software systems for business, which is independent of the technology that will realise
them. A model is first specified in MDA, and it can then be instantiated with any program-
ming language such as Java [169], C++ [63], .NET [213], and so on. This model insulates
companies from evolution of technologies, and reduces the overhead of re-implementation.
A second benefit of ensuring that a system is described in a platform independent manner
is that it should help the sharing of applications and data across different domains. OMG
is also involved with checking the consistency of object models but it is left to domain ex-
perts to ensure that an object model correctly represents the concepts in the domain. The
Chapter 2. Databases, standards and ontologies for the life sciences 50
OMG has been involved with verifying the object model for the microarray data standard,
described in Section 2.3.1.
2.3 Data standards in the life sciences
The problems of the incompatibility of data from different laboratories have been recognised
by researchers, leading to the development of data interchange formats. In the absence of a
data standard, even if published data is made available from authors’ web sites, the overhead
required to write software to interpret data from a number of different sources is often too
great, and the information is effectively inaccessible. A good data standard should ensure
that sufficient information is stored about the biological samples and experimental protocols
to enable future re-evaluation of the information. This is a major issue for digital archiving
because the volume of data continues to grow very rapidly. It cannot be assumed that it will
be possible to perform manual searches of the literature for all the relevant experiments in
the future, and automated methods will be required. In this section a brief introduction is
given to the established and proposed data standards.
The data format for microarrays, called MAGE-ML (Section 2.3.1), has influenced efforts
in other areas of functional genomics. The draft standard for proteomics, called PEDRo
(Proteomics Experiment Data Repository), is introduced in Section 2.3.2 and is one of the
main focal points of the following chapter. Mass spectrometry (MS) is a crucial part of
proteomic analysis, and was incorporated into the original PEDRo proposals. Data standards
for MS are now under development by a newly formed group, described in Section 2.3.4. In
the rest of the section, there is a description of other data exchange formats that are relevant
to life sciences research.
2.3.1 Microarray standards
Microarray experiments have now become widespread [55] and produce very large amounts
of data that could potentially be useful to researchers in a variety of contexts. The re-
quirements for central repositories of data, and standards for sharing and publishing, were
recognised several years ago [42]. A group of researchers formed the MGED (Microarray
Gene Expression Data) Society for improving the facilities for data sharing [212]. The first
stage of the standardisation process was the release of a checklist of information that should
be made available with a microarray data set to allow future re-evaluation of the data. The
checklist is known as MIAME [41] (Minimum Information About a Microarray Experiment).
Chapter 2. Databases, standards and ontologies for the life sciences 51
ArrayDesign Array
BioAssayBioMaterial
BioAssayData Experiment HigherLevelAnalysis
AuditAndSecurity
BioEvent
Description Measurement Protocol
Identifiable
identifier : Str...name : String
BioSequenceBQS
DesignElement NameValueType
name : Stringvalue : Stringtype : String
0..* 1+propertySets
0..*{rank: 1}
PropertySets
1
Extendable
0..n
1+propertySets
0..n
{rank: 1}
1 PropertySets
Description
text : StringURI : String
Audit
date : Dateaction : enum {creation,modification}
Security
Describable
0..*
1
+descriptions
0..*
{rank: 1}
1
Descriptions
0..*
1
+auditTrail
0..*{rank: 2}
1
AuditTrail
0..1
0..n+security
0..1{rank: 3}
0..nSecurity
Figure 2.5: The top level of MAGE-OM, reproduced from [212]. There are fifteen packagescontaining classes to capture different parts of a microarray experiment. There are threeclasses included at the top level: Identifiable, Describable and Extendable that can beused by most other classes in the model for linking to additional attributes.
MIAME specifies the parts of experimental protocols, sample details, raw data and analysis
that must be released for an experiment to be understood and potentially reproduced, if
the same biological samples are available. The MIAME guidelines have been accepted by a
number of journals, and they must be satisfied for a publication to be accepted [23, 24, 25].
A formal specification of the microarray requirements was released as an object model,
MicroArray Gene Expression-Object Model (MAGE-OM), expressed in UML. The object
model serves two purposes. Firstly, the class diagrams allow developers to have a shared
understanding of the concepts and relationships in the standard. Secondly, the object model
has been used to generate a software toolkit, available from the MGED website, which allows
developers to create applications that process data into an exchange format, based on the
model. The data format, MAGE-ML [297] (MAGE-Markup Language), is expressed in XML,
and several major databases now accept MAGE-ML for loading data (Section 2.4.1). An
essential component of the standard is the MGED Ontology that consists of a controlled
vocabulary of terms used in microarray experiments (described in Section 2.5).
Chapter 2. Databases, standards and ontologies for the life sciences 52
Figure 2.6: The BioMaterial package in MAGE-OM, reproduced from [212].
Chapter 2. Databases, standards and ontologies for the life sciences 53
The MAGE object model
The overview of MAGE-OM is displayed in Figure 2.5. There are fifteen packages, each
containing a number of classes to represent part of a microarray workflow. For example,
Array, ArrayDesign and DesignElement describe the features on a microarray, and BioAssay
describes the hybridization of mRNA to the array. MAGE-OM is designed to allow as much
flexibility as possible to ensure that it does not restrict the types of experiment that can be
captured. An example of this is in the BioMaterial package shown in Figure 2.6. The package
is intended to capture the substances that are processed at various stages in the experiment.
A BioMaterial can be one of three types: a BioSource (the source of biological material),
a LabelledExtract (for example the fluorescently labelled mRNA that is hybridized to an
array) or a BioSample (any intermediate between a BioSource and LabelledExtract). This
is an example of inheritance because the three classes inherit relationships from BioMaterial.
The use of inheritance should reduce the amount of programming required to capture this
part of the model because the relationships to other classes only need to be coded a single
time for BioMaterial, rather than three times for each of the more specific classes. One of
the relationships allows the class to reference OntologyEntry, which can be used to specify
a number of characteristics about the material, by obtaining the values from a controlled
vocabulary. Any kind of simple laboratory treatment can be described using a combination
of the class Treatment and the relationship to OntologyEntry, which captures the type of
treatment.
EXAMPLE: The mRNA that is hybridized to an array is captured in LabelledExtract.
LabelledExtract references the set of treatments that have been used to create it, via
Treatment, BioMaterialMeasurement and BioSample. Chemical compounds, such as the
fluorescent labels that are attached, are recorded in Compound. A cycle of treatments can be
described that points back the original starting material in BioSource.
This package does not contain any classes that are specific to a microarray experiment, and
therefore could potentially be used to model concepts from other types of functional genomics
experiment. This issue is expanded on in Chapter 4, in which MAGE-OM is combined with
a model of proteomics data to form a proposal for a data standard that we believe can be
extended to cover all functional genomics techniques.
Chapter 2. Databases, standards and ontologies for the life sciences 54
2.3.2 PEDRo
In recent years, the success of MAGE-ML as a microarray standard has encouraged re-
searchers in proteomics to attempt a similar standardisation procedure. The status of pro-
teomics standardisation is the focus of the following chapter but a brief overview is given
here. The Proteomics Experiment Data Repository [315] (PEDRo) object model has been
released to initiate discussion in the community about the requirements for a data stan-
dard. Data standards for proteomics are managed by the Proteomics Standards Initiative
[257] (PSI), which was formed by the Human Proteome Organisation [161] (HUPO). PEDRo
represents a typical proteomics workflow, and consists of four parts:
• Biological sample origin.
• Protein separation techniques.
• Mass spectrometry laboratory protocols.
• Mass spectrometry data analysis.
PEDRo is designed to allow an experiment involving a number of stages of protein separation
to be described, including: 2-DE, affinity columns and chemical treatments. MS data is also
described in the PEDRo model, including support for storage of database searches and the
results of the searches. There are a number of organisations developing standards for MS
to serve different purposes (described below), therefore it is important that a consensus is
reached. A detailed description of PEDRo is given in the following chapter.
2.3.3 PSI-OM
PEDRo was presented to the PSI in 2003 as a proposal for a data standard for proteomics.
A new object model was developed in 2004, loosely based on PEDRo, called PSI-OM (Pro-
teomics Standards Initiative - Object Model) to which the author contributed at the annual
meetings of the PSI. PSI-OM has a similar structure to PEDRo covering protein separa-
tion techniques and MS. In the following chapter, there is a description of an object model
we developed (Gla-PSI) that preceded the development of PSI-OM, therefore a complete
description of PSI-OM is given after the section on Gla-PSI.
2.3.4 Mass spectrometry
Mass spectrometry is used in proteomics to identify proteins. An experiment generates raw
data, in the form of a trace, and processed data comprising a list of peaks that correspond to
Chapter 2. Databases, standards and ontologies for the life sciences 55
the masses of peptides. There is a major problem preventing re-analysis of MS data, which
is caused by the proprietary data formats generated by mass spectrometer manufacturers.
Instruments are supplied with software for data collection and analysis. The software only
provides the functionality to save analysis within a data format that cannot be interpreted
by any other software. Researchers often manually enter the peak heights into a text editor,
for input into database search programs. Proprietary formats pose a major problem for
research throughput and data archiving. It cannot be assumed that the software needed to
interpret the spectra will still be available in the future. It is also not feasible for researchers
wishing to analyse the spectra deposited in databases, to obtain the software that produced
them. Therefore, there is a great need for a data exchange standard that can be interpreted
without specialist software. The standard must support algorithm development for large
scale database searches.
There are several proposals for MS standards including GAML (Generalized Analytical
Markup Language [128]), SpectroML and the Analytical Information Markup Language
(AnIML) [13]. Both SpectroML and AnIML have been developed by the National Institute
for Standards and Technology in the USA [222]. GAML is an industry generated effort to
develop an XML-based data format for analytical instruments. GAML stores values of X/Y
coordinates from a trace, and the parameters entered in the instrument. SpectroML has
similar goals, and was originally developed in collaboration with ASTM, an internationally
recognised standards organisation [18]. SpectroML has now been superseded at the ASTM
by AnIML, which is a wider XML based format for analytical instruments. The PEDRo
model also supports MS data.
A recent project has been initiated at the Institute for Systems Biology, known as
mzXML, which is part of the SASHIMI open source software for downstream analysis of
MS data [278]. The goal of the project is to produce software for processing each of the out-
put formats produced by different instrument vendors, into a single XML file. The mzXML
format can then be analysed with a single piece of software that has a statistical measure
of the likelihood that a correct match has been made to a protein. This should improve the
comparability of data produced by different types of instrument.
The efforts described above are being coordinated by a sub-group of the Proteomics
Standards Initiative, and meetings of the PSI have been well supported by MS instrument
manufacturers. A single proposal, mzData, has been formulated. It is agreed that vendors
will supply software with their instruments for creating output in mzData format. The
Chapter 2. Databases, standards and ontologies for the life sciences 56
first version of mzData describes the raw data from MS, which is the list of peaks on the
trace, and the format also captures the input parameters that are produced by different
instruments [258]. The next version of the format will capture the input parameters and
results of database searches, in addition to the peak list used to identify proteins.
2.3.5 Protein interaction standards
Protein interaction experiments have become widespread, and there are a number of
databases that offer access to large volumes of data arising from Yeast Two-Hybrid and
affinity column experiments, such as BIND [32], DIP [65], MINT [367] and many others.
There is some overlap in the data coverage between the databases, and therefore it is desir-
able that data can easily be exchanged between different systems. This requirement led to
the development of the PSI interaction standard [150], which is now supported by most of
the publicly available databases. The format is being developed incrementally, and the first
release (level 1) covers the majority of data that is currently available. Level 1 can describe
both binary, and more complex interactions, but the format does not include detailed de-
scriptions of the experimental methodology used to generate the data, or a description of
the mechanism of interaction. This kind of data is not widely available at the present time
but may be supported in future versions of the standard.
2.3.6 Other data standards in life sciences
Mathematical models of biological data
The data generated by functional genomics, and traditional biochemistry experiments, reveal
information about the role of proteins and metabolites in a cell, and the interactions between
different components. Researchers have begun to create mathematical models of chemical
reactions and biological processes, which can in theory predict what changes would be prop-
agated to the system when part of it is perturbed. Mathematical models are published
in journals, often represented as a series of equations printed with mathematical symbols
that cannot be interpreted by a computer. Models are also represented by software, and
can therefore be released as computer code, however there are a large number of different
programming languages and different versions of code, therefore it is not easy to combine
models that have been developed independently. The problem is further complicated be-
cause processes can be modelled at different physiological levels: cellular, tissue, organ and
organ systems can all be represented mathematically. Researchers would ultimately like to
Chapter 2. Databases, standards and ontologies for the life sciences 57
integrate models represented in different formats, and at different levels of detail.
CellML has been created to standardise the format in which mathematical models of
cellular functions are described [196]. CellML is expressed in XML, and uses constructs from
another well-established format known as MathML [340]. MathML describes mathematical
equations and consists of two types of encoding: content and presentation, the first for
expressing what is meant by a mathematical expression, the second deals with how the
expression should be presented for a web browser or printer.
The main constructs of CellML are components and variables, and MathML is used to
specify a mathematical relationship between variables that have been declared by a compo-
nent of the model. CellML also has structures for describing reactions, units, and connections
between different components. The complete specifications for CellML are available through
the web site [51]. It is hoped that researchers wishing to publish a model of a physiological
process will release the model in CellML, allowing future integration with other relevant
models.
The Systems Biology Markup Language (SBML) has been created to model biochemical
networks, such as metabolic pathways or sets of co-regulated genes [155]. Conceptually,
a biochemical reaction can be broken down into a number of components that comprise
the main parts of SBML, including Compartment, Reaction, Rule and several others, each
of which has a textual description, and a number of associated attributes. The format is
expressed in XML and there are various software packages that support the first version
of SBML [311]. The second version of SBML may include MathML support, which could
enable some interchange between models represented in CellML and SBML.
Metabolomics
A new area of functional genomics is the study of the composition of small molecules (metabo-
lites) in different samples, using NMR (Nuclear Magnetic Resonance) and mass spectrometry,
known as metabolomics. The metabolomics community does not have a current data stan-
dard, however a data model has been created to record a generic NMR experiment. The
work is part of the Collaborative Computing Project for the NMR community (CCPN).
CCPN contains an object model and a programming interface for creating software [113]. It
is possible that CCPN could contribute to a data standard for metabolomics although it is
likely that additional modules will be required to capture the biological focus and intention
of a metabolomics experiment.
Chapter 2. Databases, standards and ontologies for the life sciences 58
An object model has been recently released as part of the Chemical Effects in Biological
Systems (CEBS) database developed by the National Center for Toxicogenomics in the USA
[355], called SysBio-OM. SysBio-OM covers various components of microarray, proteomics
and metabolomics experiments, however, due to its recent release, it is not possible to say
whether the metabolomics component will gain widespread use in the community. The CEBS
proposal is discussed in detail in Chapter 4.
2.4 Databases for life sciences
Databases are often created by small research communities wishing to disseminate their data
to a wider audience. The problem with this model is that no standard protocols exist for
accessing or querying databases, and many databases have their own text formats to allow
researchers to download the data in bulk. This presents several problems to the user. Firstly,
a researcher may not know about all the databases that exist which could be relevant. This
was the motivation for the creation of the NAR Molecular Biology Database Collection
to improve awareness of the databases that exist. Secondly, it is very slow to browse or
query all the relevant web sites manually, and assimilate the information by cutting and
pasting into a word processing document or spreadsheet. This problem is partly remedied
by systems like SRS (Sequence Retrieval System) [99], which present pointers to relevant
data items. However, the onus of data acquisition and assimilation of results is still on the
user. Thirdly, the databases are highly dynamic, and some are updated daily. Database
updates most commonly involve new data being added, but errors are also corrected and ID
numbers change with different database releases. Data that has been found by a researcher
may become out of date fairly rapidly, and there are no standard methods for automatic
repetition of the same searches. There are considerable efforts to alleviate these challenges
by employing data integration methods, described in Section 2.6.
A different aspect of the data integration challenge is the storage of heterogeneous data
types within unified systems that can be queried. Chapter 5 describes a database system for
proteomics, which is built on top of an existing microarray database system, as an extension
into a wider system for functional genomics. In this section, a comparison of the features
offered by different microarray databases is given, and the systems that already exist for
proteomics are described. There are several other databases that are highly relevant to
functional genomics research, outlined in Section 2.4.3.
Chapter 2. Databases, standards and ontologies for the life sciences 59
2.4.1 Microarray databases
The development of a database that is capable of storing both proteomics and microarray
data is described in Chapter 5, which is an extension of the RAD (RNA Abundance Database)
system developed at the University of Pennsylvania. However, there are a large number
of different databases for microarrays that offer various different capabilities. A detailed
review of the main features of microarray databases was published by Gardiner-Garden and
Littlejohn in 2001 [119], which is brought up to date in this section (Table 2.1).
ArrayExpress
ArrayExpress at the European Bioinformatics Institute has been developed by researchers
who have been central to the efforts of MGED to standardise microarray data [16]. Ar-
rayExpress accepts public deposition of data, can be queried via a web based interface,
and is MIAME compliant. Data can be sent to ArrayExpress in MAGE-ML format, and
the database can store a significant amount of detail covering experimental protocols and
biological samples.
URL: www.ebi.ac.uk/arrayexpress/
RAD
RAD (RNA Abundance Database) is a system produced at the Center for Bioinformatics,
University of Pennsylvania [302]. RAD is capable of storing single or two channel arrays,
Affymetrix arrays and SAGE experiments (Serial Analysis of Gene Expression). There is
a web based interface for loading data and protocols known as the RAD Study-Annotator
[202]. The database schema for RAD, and the web interface, are freely available. As part
of the GUS (Genomics Unified Schema) system for functional genomics, it supports gene
expression data on several major web sites, such as PlasmoDB [21].
URL: www.cbil.upenn.edu/RAD
Stanford Microarray Database
The Stanford Microarray Database (SMD) [134] is a well established system that stores 160
published array experiments (March 2004), from a number of organisms. The web site can be
queried to retrieve particular studies, and a set of software is available for data visualisation
and statistical analysis, such as graphical output from ANOVA (analysis of variance [88]).
Searches can also be performed for a particular gene or clone across all microarrays. The
Chapter 2. Databases, standards and ontologies for the life sciences 60
software used to generate SMD is freely available, and has been deployed by a several other
organisations. SMD researchers are part of the MGED effort, SMD is MIAME compliant
and there are plans to enable export of MAGE-ML in the future.
URL: genome-www5.stanford.edu
BASE
The BioArray Software System (BASE) is freely available for researchers to download and
install locally [275]. BASE includes a database schema that can be deployed in MySQL
[221], and an interface, which runs on a web server, can be created using PHP [246], Java
[169] and Javascript [171]. Data produced by image processing software can be loaded in
tab-delimited files, and additional software is included for performing statistical analysis.
BASE has several advantages over other similar systems. Firstly, all the software required to
run BASE is freely available: PHP, MySQL and Java. Secondly, all the source code for the
project can be downloaded and altered as required. However, a system based on MySQL is
likely to be less robust than one based on a commercial RDMS, such as Oracle [235] or DB2
[70], therefore BASE may be more suited to smaller scale microarray databases.
URL: base.thep.lu.se
GEO
The GEO (Gene Expression Omnibus) database is hosted at the NCBI [85]. GEO has
different goals from the other microarray databases discussed so far. The support of the
MIAME guidelines and the MAGE format are not major goals of GEO. In contrast, GEO
aims to act as a large public repository for as wide a range of data as possible. Each
experiment is stored in a simple, tabular format that is indexed to allow searches. Data
can be submitted by any organisation, using either a web based interface or a bulk loading
facility. GEO has been incorporated in the Entrez system3, and therefore information can
be queried in parallel with bibliographic references, and databases of nucleotide or protein
sequences [123]. GEO does not store substantial information about protocols or biological
samples, and can be viewed as a very large data repository rather than storing microarray
experiments.
URL: www.ncbi.nlm.nih.gov/geo/
3Entrez is the data retrieval system at the NCBI which performs queries over a large number of differentNCBI databases [97], described in Section 2.6.
Chapter 2. Databases, standards and ontologies for the life sciences 61
Yale Microarray Database
Yale Microarray Database (YMD) [54] is in the final stages of testing with a number of data
sources, and is not as well established as ArrayExpress, SMD or RAD. However, YMD offers
certain features not present in other systems. Microarray images are fairly large, and each
experiment can contain hundreds of raw images, each being a TIFF file several megabytes
in size. Most databases choose to store only the processed data, created by software after
analysis of images. YMD includes an image server that enables researchers to obtain raw
images for future re-analysis. It remains to be seen how frequently images will be re-analysed,
but by keeping raw data, this ensures that future evaluation is possible, even if the amount
of data stored grows very rapidly. Experimental protocols can be entered via the Web,
and sample tracking can be performed to link DNA samples to the arrays. Data stored in
YMD can be linked to external resources, and a number of tools are available for performing
statistical analysis. The image server in YMD is both the advantage and disadvantage of the
system: data can be re-analysed but the system may not scale up to very large data sets.
URL: info.med.yale.edu/microarray/
HugeIndex
HugeIndex is a gene expression database developed at Harvard [148]. The database schema is
very simple, containing only four tables and it is intended for storage of microarray results and
limited information about the experiment. HugeIndex is specialised to store gene expression
data from normal human tissues. The query interface allows particular genes to be specified,
or data can be accessed by the type of organ. The initial release in 2002 contained 59
experiments.
URL: HugeIndex.org
Integration across all databases
A scheme for how data can ultimately be integrated across all the databases has been out-
lined by Stoeckert and colleagues [303]. In essence, all databases have different structures,
reflecting the needs and requirements of the local users that are supported by the system.
If data is to be published, it should be made available via the Web, and conform to the
MIAME guidelines that are essentially a checklist of parts of the analysis that must be made
available. However, this alone is not amenable to large scale automatic analysis. For that
to be possible, researchers must either make data available in MAGE-ML format, or submit
Chapter 2. Databases, standards and ontologies for the life sciences 62
DatabaseName
RDMS Webqueries
Totalexpts.
Sourcecodeavailable
MIAMEcompliant
MAGEImport-Export
Array-Express
Oracle Yes 115 Yes Yes Import
BASE MySQL N/A Intended forlocal setup
Yes Yes Exportplanned
GEO Storage ofindexedtables
Yes 605 No or N/K No No
SMD Oracle Yes 160 Yes Yes Exportplanned
YMD Oracle N/K N/K Notcurrently
Notcurrently
N/K
RAD Oracle Yes 16 (RAD),many inGUS sites
Yes Yes Both underdev.
HugeIndex PostgresSQL Yes 59 (2002) Yes No Future plans
Table 2.1: Summary table displaying features of microarray databases. Data is correct as ofMarch 2004, except where stated. A N/K symbol (not known) indicates that the informationis not readily available.
data to a public database that has an export option for MAGE. Currently, few databases
actually create MAGE-ML, due to the complexity of the format, although almost all, with
the exception of GEO, plan to produce MAGE-ML in the future. When this is realised, it
will be possible to move data seamlessly between public repositories, and for researchers to
download and assemble large datasets, for analysis with locally installed software packages.
2.4.2 Proteomics databases
The following chapter contains a proposal for a standard data format for proteomics, and
covers the current output formats from several databases. There is also a detailed description
of other databases and a comparison with our system in Chapter 5. A brief overview of the
publicly available systems is given here.
There are a number of proteomics databases that can provide access via the Internet.
SWISS-2DPAGE was initially developed in 1993, storing 2-DE images and information about
proteins identified on gels. The proteins often have a link to a record in the annotated
sequence database, Swiss-Prot. SWISS-2DPAGE has an interface containing images of 2-D
gels, which can be used to access information about protein spots [153].
Another proteome database, developed by the Japanese Human Proteome Organisation
(J-HUPO [166]), has an output format known as HUP-ML. HUP-ML is centred on 2-DE data
and experimental protocols, allowing the constituents of solutions and timings to be specified,
Chapter 2. Databases, standards and ontologies for the life sciences 63
similar to sample preparation stages described in MAGE-ML. There are a number of domain
specific proteome databases, storing 2-DE or MS data (a summary of proteomics databases
can be found at WORLD2D-PAGE [351]). In general, the databases store only limited
information about experimental protocols and are not fully integrated with other types of
protein databases. It is a major challenge to integrate distributed proteomics databases
because data is not formatted in a uniform manner, and the databases rarely offer flexible
query facilities.
The GELBANK system has recently been made available over the Internet [20], and
has similar functionality to SWISS-2DPAGE. There is also 2D-PAGE database at the Max-
Planck Institute in Berlin, storing images of 2-D gels that can be annotated with spot
coordinates which link to pages describing proteins that have been identified [255]. Basic
information about protocols is stored, and gels can be browsed by species. The functionality
of these 2-D gel databases is described in more detail in Chapter 5.
There are no major repositories of mass spectrometry data which have query facilities,
possibly due to the size of the output format for MS and the problems of incompatible data
formats, as reported in Section 2.3.4. One effort that attempts to remedy this situation is
RADARS [106], which is a commercial relational database application for managing large
volumes of data from high-throughput studies. Due to the commercial nature of the soft-
ware it is not possible to assess the functionality of RADARS in practice. Another recent
development is the Open Proteomics Database that allows bulk downloads of raw MS data
in various formats, including mzXML [232]. This system allows public access to a large
amount of data (400,000 spectra), but it requires developers to obtain software to interpret
and manage the spectra once downloaded, and the spectra cannot be queried online. This
prevents it from being used by most researchers, who do not have the time or resources to
obtain software for managing this volume of data.
2.4.3 Other Databases for Life Sciences
The databases for microarrays and proteomics rely heavily on the existence of genome
databases for linking to annotation about gene products, and obtaining the original DNA or
protein sequences. The main databases containing nucleotide and protein sequence data are
GenBank at the NCBI [122], EMBL [91] and DNA Data Bank of Japan (DDBJ) [80]. These
databases are generally considered to contain raw sequence data, although they do contain
some basic annotation, including bibliographic references, the data source and the predicted
Chapter 2. Databases, standards and ontologies for the life sciences 64
intron/exon structure of genes. Data are regularly transferred between the databases us-
ing an agreed mapping, called the Feature Table [72]. Certain records in GenBank have a
link to an external database, such as a curated record in Swiss-Prot [310]. Swiss-Prot has
cross-references to many different databases, including all the raw sequence databases, and
repositories of protein motifs and families.
There has been an effort to unify protein sequence databases in the Universal Protein
Resource (UniProt) system [326], which comprises several components. The main component
is a curated, non-redundant source of all the protein sequences that exist in any database.
There is also a separate archive (UniParc) containing all the identifiers with which sequences
have previously been annotated [325]. The archive contains links to the most recent record
of a protein in UniProt. The archive will enable software to be developed which performs
repeated searches, to find changes to identifiers. This will be particularly important for
datasets that have been assembled locally over time in a laboratory, and which contain
sequence identifiers that do not exist in the current version of a database.
The protein structure community has initiated a high-throughput approach for obtaining
protein 3-D structures, known as structural genomics. Protein structures are currently stored
in the Protein Data Bank [253] (PDB). Each 3-D structure gives a strong indication of the
function of a protein, particularly if the structure shows a small molecule bound to an active
site. It is vital that if a structure exists for a protein closely related to those highlighted
in a functional genomics study, that the structures can be displayed within the context of
the experiment. This will enable protein or gene abundance studies to be correlated with a
detailed functional analysis.
This review covers a very small subset of the most important databases that exist. There
are a great number of resources about genes and proteins which could be relevant to an FG
experiment. The problem of integrating all the diverse databases is highlighted further in
Section 2.6.
2.5 Ontologies
One of the first definitions of an ontology and its potential for data integration was presented
by Gruber in 1993 [138]. The idea of conceptualisation was introduced, expressed as the
following problem: how can we digitally represent objects, concepts and their relationships
that arise from a real world situation? The representation required is, in effect, a simplified
view of the world that is useful for some purpose. The term “ontology” was coined to
Chapter 2. Databases, standards and ontologies for the life sciences 65
describe the exact specifications of the conceptualisation. An ontology usually consists of
a set of terms that represent objects, and their relationships in the real world. The terms
must be associated with definitions that are human readable, describing what the term
means, along with a set of formal rules specifying how the terms can be used in a computer
system. Gruber suggests how ontologies can be used for data integration, using the example
of different bibliographic databases. For example, a rule is specified that describes what an
author is, and how the author relates to their publications. If different databases associate
records with a set of rules, the rules themselves can be used to query the source databases,
without an underlying knowledge of the particular database schema.
Ontologies will become widely used in the Semantic Web, as highlighted at the beginning
of the chapter, to describe the contents of a web site, and how it can be accessed. This will
enable software to discover automatically the resources that are relevant to the user. In
the rest of the section, a brief description of the software available for developing ontologies
is provided. The major proposals within the life sciences, and other related areas are also
reviewed.
2.5.1 Software for developing ontologies
A number of tools are available for generating ontologies, and they include Protege [230] and
OilEd [28]. The Protege software is available as open-source Java code, developed around a
‘plug-in’ architecture (Figure 2.7). This enables other research groups to adapt the software
for their own use, and develop new plug-ins. Examples of plug-ins include: software for
visualising ontologies in domain-specific ways, tools for merging ontologies, archiving and
querying. OilEd was developed at the University of Manchester and it is designed for the
development of DAML+OIL ontologies. It includes functionality for reasoning over the
ontologies for knowledge acquisition and inconsistency checking. Both Protege and OilEd
are freely available and can export data in DAML+OIL format, enabling the ontologies to be
transferred between editors, which should improve the accessibility of ontology information.
2.5.2 Gene Ontology
A major development in computational biology is the development of the Gene Ontology
(GO) [125, 126]. GO includes three ontologies: cellular localisation, molecular function and
biological process, for a number of model organisms. Entries for genes are sorted according
to the categories defined by the ontology, and the controlled vocabularies ensure that terms
Chapter 2. Databases, standards and ontologies for the life sciences 66
Figure 2.7: A screenshot of the Protege editor displaying the Gene Ontology for Yeast.
are used with same meaning in different contexts. For example, the protein Raf-1, that is
involved in the MAP Kinase metabolic signalling pathway, has many entries in GO. One
entry in the biological process branch of the ontology is as follows:
A database and a user interface have been developed that enable GO to be queried [126].
GO annotations are being added to Swiss-Prot, TrEMBL4 and Interpro5, in a project known
as GOA [46] (Gene Ontology Annotation). Each entry in Swiss-Prot has several keywords
that describe a protein’s function, which were developed prior to the creation of GO. The
keywords have been manually mapped to GO terms. This now allows for automatic retrieval
of GO annotations, once a protein sequence has been found in Swiss-Prot.
4TrEMBL is an automatically annotated supplement of Swiss-Prot, which contains all the translations ofthe EMBL DNA database prior to their manual annotation within Swiss-Prot [37].
5Interpro is a database of protein families and domains [219].
Chapter 2. Databases, standards and ontologies for the life sciences 67
Figure 2.8: The entry for actin in the Gene Ontology, displayed in the AmiGo browser [12].
There are a number of other projects extending GO, and GO is being used by a number
of organisations to add levels of information to gene and protein products (links can be
found from the GO web site [124]). GO is a major advance in molecular biology because
it enables a high level view of large datasets, allowing researchers to generate functional
classifications very rapidly for all genes or proteins in a data set. However, it is vital that
the Gene Ontology is continuously curated and improved, to reduce the number of incorrect
or inaccurate functional assignments. It is becoming common practice for researchers to
obtain the top set of significant results from their study, say 100 genes or proteins, and
assign functional groupings based on GO. The conclusions drawn from the groupings must
be verified by external means, such as further experiments or literature surveys, because it
is possible that errors have been introduced into GO, which may be propagated into other
systems built on top.
Software for GO
A number of software applications are available for viewing and searching GO, of which
several are summarised here. Access from the Gene Ontology web site is provided by the
AmiGO browser [12]. AmiGO presents a view of the GO tree that can be browsed, allowing
Chapter 2. Databases, standards and ontologies for the life sciences 68
users to move up or down the hierarchy of the ontology. Figure 2.8 displays the GO tree for
the human gene actin. In this example, GO suggests that actin is localised in the cytoplasm,
and more specifically to the cytoskeleton of the cell. A gene can be found at many different
places in GO if the gene has been implicated in several different processes, or possibly if there
is conflicting evidence about function. The AmiGO browser has basic search mechanisms for
retrieving entries by GO ID, ontology term or gene name. There is an alternative graphical
view of GO, and parts of the tree can be downloaded in XML format or as a text file.
GOMiner is a stand alone application written in Java, which provides a view of GO for
a list of genes that are predicted to be up or down regulated between two conditions [368].
The software displays where gene names are located in the GO tree and provides statistics
to show branches of GO that contain more up-regulated (or down-regulated) genes. A DAG
(Directed Acyclic Graph) viewer is also included that displays graphically where genes appear
in the tree.
FatiGO offers similar functionality to GOMiner but in a web browser interface that
accepts two lists of gene symbols, corresponding to the genes that are up or down regulated in
a study [7]. Summary information is produced outlining where terms appear in the ontology,
for the three different ontology parts. Statistics are provided displaying which parts of the
ontology are matched to genes that are up or down regulated in the study. The software
can display information for a specified level in the ontology, from 2 down to 5 (lowest level)
and links to external databases are provided, such as Swiss-Prot, and the KEGG database
of metabolic pathways [184]. The usage of GoMiner and FatiGO in practice is demonstrated
in the study presented in Chapter 6.
GOblet also provides access to GO via the Web. DNA or protein sequences can be
submitted to a BLAST survey that returns the best matches to sequences in Swiss-Prot and
TrEMBL, which have been mapped to GO terms [149].
2.5.3 MGED Ontology
The MGED Ontology (MO) is a hierarchical collection of terms used to describe microarray
experiments. Each term has a textual description of its meaning, and a specification of
where it should be used in MAGE-OM. MO contains terms that can be used to describe the
origin and characteristics of biological samples, regardless of the usage of the sample. For
this reason, MO could be utilised to describe samples in a number of functional genomics
investigations. In Chapter 4, a proposal is made for a functional genomics data standard,
Chapter 2. Databases, standards and ontologies for the life sciences 69
and a detailed description of the contents and structure of MO is given there.
2.5.4 Other ontologies in life sciences
Ontologies are being created to model various different domains within the life sciences. The
OBO (Open Biology Ontologies) project aims to bring together related ontologies into a
common structure [233]. A set of rules has been established for inclusion within OBO: the
ontologies have to open and freely available, described in a common syntax (GO or OWL)
and must have a definition that can be understood by people. An organisation has also been
established for unifying the work in ontologies for functional genomics, known as Standards
and Ontologies for Functional Genomics [298] (SOFG). A brief description of some of the
ontologies within OBO is given below.
Taxonomy
The NCBI taxonomy ontology is an important resource for standardising the taxonomic
naming of organisms [224]. The ontology is accessible via the Web, and the records contain
links to other information about the organism through Entrez, such as nucleotide and protein
sequences, expression data and publications.
Anatomy
There are several ontologies covering the anatomy of organisms: such as C. elegans [353],
Drosophila [112], mouse [90, 218] and humans [100]. The SOFG organisation is coordinating
an effort to integrate them to produce a single anatomical ontology. A related project
is XSPAN from the University of Edinburgh, which aims to provide access to anatomical
information from embryos for several model organisms [358].
Sequence data
The Sequence Ontology (SO) project has recently been initiated to capture information
about features on DNA and protein sequences, such as chromosomal variations, gene features
(intron and exon structure) and RNA processing during transcription [286]. It is intended
that genomic databases will be annotated with these terms to facilitate integration across
systems offering different methods of querying.
Chapter 2. Databases, standards and ontologies for the life sciences 70
Metabolic pathways
One of the first major proposals for ontologies for molecular biology was made by Karp in
1995 [183]. Karp presents the idea that knowledge representation could be used to determine
mappings between different databases to aid integration. The architecture proposed by Karp
was influential in a number of data integration projects described in Section 2.6. A database
of E. coli genes and biochemical pathways was later defined, known as EcoCyc [182]. EcoCyc
contains curated descriptions of the function and chromosomal location of all E. coli genes,
and uses an ontology of pathways to allow the knowledge to be formally queried. EcoCyc
presents an integrated view of data derived from a number of sources including genome
databases, bibliographic references, and protein structures.
Summary
In a functional genomics database, many of the ontologies described above could be used for
specifying characteristics of biological samples, genes, proteins or experimental techniques.
Database systems for functional genomics should provide the facility to link out to external
ontologies so that an object can be specified, which is accompanied by an exact definition
that has a meaning outside the scope of the source database. It is hoped that if databases
use ontologies extensively, the vision of the Semantic Web can be realised, and as Gruber
proposed, data integration can become automated. Software can then be developed to recog-
nise objects automatically in different databases which correspond to the same real world
objects.
2.5.5 The Grid and data integration
The Grid is the next generation architecture for high performance computing [132]. The
Grid is a network of computers joined by high bandwidth connections, allowing the creation
of software that assigns a computationally intensive job to the best available resource on the
network. There is a collaborative effort to perform data integration on a large scale via the
Grid, known as OGSA-DAI (Open Grid Services Architecture Data Access and Integration)
[234]. OGSA-DAI comprises many projects aiming to provide access to vast data sets, in
particular in the fields of astronomy, geoscience and biology. One of the major biological
proposals is called myGrid.
myGrid comprises a network of biological web services, such as BLAST and EMBOSS6,
6EMBOSS is an open source package of software for performing common sequence analysis tasks [92].
Chapter 2. Databases, standards and ontologies for the life sciences 71
which must be registered at a central location [300]. Each resource must contain a standard
description of the type of service it offers, and how it can be accessed. Once this infrastructure
is in place, it will be possible to write software that automatically discovers applications that
are available for performing the task required by the user. Each service has a wrapper7 to
enable standard queries to be submitted, and to convert between different input formats.
Queries can be written in OQL [1] (Object Query Language) and submitted to the source
database over the Grid. The system is specifically tailored to an organisation because a
database is maintained at each location, storing a record of the services that have been used
in the context of a particular workflow, thus facilitating their re-use. The local database also
stores an audit trail of what services have been used at what time, with a system that alerts
the developers if an external data source or service changes, such as a new database release,
which may require a search to be performed again.
2.5.6 Data standards and ontologies in other fields
Ontologies and data standards are becoming widespread in the life sciences but are also
widely used in commercial applications and other fields of research. A related area is the
development of ontologies of language, which could also have uses in the life sciences. The
WordNet project comprises an ontology of the English language, in which nouns, verbs,
adjectives and adverbs are organised into synonymous groupings, similar to a thesaurus [350].
Synonyms in the life sciences present considerable challenges. In particular, many genes and
proteins have been given more than one name over time, and the synonyms often persist. It
is becoming more common to store experimental protocols and descriptions of hypotheses
alongside raw data, to enable data sets to be retrieved. Resources such as WordNet will
be useful for defining particular concepts in a standard way that could be described using
different, synonymous terms.
2.6 Data integration
Data integration is one of the greatest challenges currently facing bioinformatics [299]. The
Molecular Biology Database Collection contains 548 databases at present, and this is likely
to be an underestimate of the total number of different systems that are available. The
integration challenge can be broken down into different parts: firstly, bringing together
7A wrapper is a piece of code that converts the specific inputs and outputs offered by a single applicationto a standard set of inputs and outputs.
Chapter 2. Databases, standards and ontologies for the life sciences 72
similar types of data, such as genome, transcriptome and proteome into a single system that
can be adequately queried is one challenge. A second challenge is discovering and querying
all the resources on the Internet that relate to one particular gene or protein sequence.
The first challenge of data integration is addressed in Chapter 5, in which a framework is
described for storing different types of FG data in one system. A possible solution to the
second challenge has also been addressed, in the context of indexing large collections of XML
data, to generate an integrated query system to a number of databases. An investigation
by the author into XML indexing for biological data is described in Appendix A, which has
been continued in the Xtect project [359] by colleagues at the University of Glasgow and the
University of Strathclyde.
There has been substantial work in the area of data integration in e-commerce and
biomedical fields with the aim of generating single access points to heterogeneous data
sources. In a survey of approaches by Garcia-Molina et al. [118], three general methods are
identified: federation, warehousing andmediation. Federation involves a set of databases sup-
plying agreed additional information or software for accessing information in a standard way.
Warehousing is a large scale approach of reconstructing local copies of relevant databases by
creating an integrated schema that covers all the constituent databases, and importing data
on a regular basis. Mediation based approaches send queries to diverse databases, in some
cases via the Internet, and convert the results into a single format. Examples of biological
resources that have utilised these approaches are given below.
2.6.1 Federation
The Entrez system provides access to many different databases based at the NCBI [314].
Entrez queries GenBank, PubMed, GEO and many others (the web site has a complete list
[97]), and provides a number of output formats including HTML, XML and a text format.
However, there is no integration of results, instead a list of the number of hits in each of
the database is returned, which must be manually browsed by the user. This process is very
time consuming, especially if a large number of genes are to be queried, such as the top 200
hits from a microarray experiment.
2.6.2 Warehouses
One of the largest efforts to integrate life sciences data has been demonstrated by SRS [99]
(Sequence Retrieval System), which provides access to a large number of databases using
Chapter 2. Databases, standards and ontologies for the life sciences 73
pre-defined hyperlinks. SRS downloads all the source databases at a regular interval and
builds a text index. SRS accepts queries against any type of text in the entry, and allows
users to retrieve a record with a particular ID number. SRS does not post-process the queries
to integrate the information, instead a list of entries from different databases is returned.
SRS does not support a major query language such as SQL, therefore complex queries cannot
be made.
The GUS system from the University of Pennsylvania comprises a large relational schema
that is divided into different namespaces8, which have been developed from separate source
databases. Data from various sources (Genbank, EMBL, DDBJ and others) are downloaded
at intervals, cleaned to remove erroneous annotation, and added to the database. A pro-
gramming layer resides on top of the database to allow queries to be performed. In addition
to genomic data, GUS also stores microarray data, and Chapter 5 describes a proposal for a
proteomics extension to GUS.
2.6.3 Mediator approaches
K2
An approach known as K2 has also been developed at the University of Pennsylvania, which
formulates queries over a number of databases, and presents an integrated view to the user.
K2 originated in a project known as Kleisli that introduced the idea of mediators [68, 348]
and a query language known as Collection Programming Language (CPL). The mediators
describe the data sources in terms of common objects, and provide a mapping from the un-
derlying data source to the objects. CPL can then be used to query the object representation
of the data, even if the underlying data sources do not have query capabilities. The system
has been compared with GUS by Davidson and colleagues [67]. Davidson concludes that the
data warehousing approach is preferable for larger scale, production-strength applications,
and the mediator approach may be favoured for smaller systems for users wishing to browse
data sources via web pages.
TAMBIS and BioDataServer
There are a number of other bioinformatics systems to integrate heterogeneous resources,
including TAMBIS, which has been developed as an interface to a number of databases
8A namespace is a subdivision of a database schema or object model in which all the names of thecomponents are unique.
Chapter 2. Databases, standards and ontologies for the life sciences 74
and tools frequently used by biologists [241]. TAMBIS is developed from the same software
used to generate K2, and uses mediators to access several databases. TAMBIS is supported
by a description logic known as GRAIL [22] that includes rules to link different concepts
together. For example, a protein is formally linked to motifs found in its sequence by the
rule hasComponentMotif. GRAIL is used to formulate queries, and automatically retrieves
information from the relevant database.
Another mediator based system is BioDataServer [114] that enables information retrieval
over a number of biological databases, with similar goals to K2. BioDataServer generates an
interface that maps the data sources into a relational database, and can be viewed as a cross
between a mediator and a warehousing approach. BioDataServer enables complex queries
to be formulated over the data, even if the underlying query capabilities of the data sources
do not support SQL.
DiscoveryLink
DiscoveryLink from IBM offers access using SQL to a number of databases in distributed
locations [144]. DiscoveryLink processes queries and decides which parts of the query need
to be sent to which database. Each data source has a wrapper that maps the structure
of the source data to the relational model employed by DiscoveryLink. The wrapper also
stores information about the query capabilities of the data source, and maps parts of queries
sent from DiscoveryLink into the format accepted by the data source. For a wrapper to
be developed, it is required that the underlying data source must include an interface that
accepts programmable queries, and must return data in the form of a table. The software
can then process the results after they have been returned. DiscoveryLink does not offer
any kind of semantic integration, for example the problem of synonyms is not solved, and
redundant data may be returned. If data is modelled differently by different databases, all
the results will be presented to the user, but will not be fully integrated.
2.6.4 Schema integration
An alternative approach, that could be used to develop a warehouse, is that of schema
integration, which involves matching elements in different database schemas believed to
correspond to the same real world object. Many approaches involve a manual process using
a graphical user interface to match elements from different schemas, which is time-consuming
and error-prone. Attempts have been made to automate schema matching [81], and recent
Chapter 2. Databases, standards and ontologies for the life sciences 75
work has been done to integrate XML data sources. Integration using XML is gaining
popularity in molecular biology, because many databases now offer a bulk download of data
in XML. If a mapping can be produced across different XML Schemas, a warehouse could
be created by importing different databases in XML, and converting data to a standard
representation.
Yang et al. [363] have developed an algorithm that finds matching elements in XML
Schemas and removes differences in the hierarchical structure. The algorithm allows differ-
ent schemas to be weighted according to how representative they are of the system, and
produces an integrated schema. However, it is assumed that elements in different documents
have already been re-named so that real world objects all share the same name in different
schemas. This is not necessarily a trivial task if a great number of different databases are to
be integrated. In molecular biology many concepts have synonyms, and conversely, similar
but non-identical concepts may share the same name. A similar approach relevant to data
integration is XClust [194], which is an algorithm for clustering and integrating DTDs (doc-
ument type definition, the initial proposal for validating XML). The algorithm first searches
for similar DTDs, and then integrates over clusters of related documents. The technique has
been demonstrated for real world data sets derived from e-business, and may be applicable
to biological database integration.
A recent approach by Hunt and colleagues at Glasgow University aims to alleviate the
data integration challenge by developing indexes of the paths9 found in XML documents. If
more than one identical path is found to the same leaf node, containing the same piece of
data, the additional paths are removed to avoid redundancy. The index is created on top of
the SRS system and is stored in a relational database. It is intended that the system will be
used to retrieve data from a large number of databases, for a set of genes or proteins that
are highlighted for further study from a functional genomics experiment [159].
2.7 Discussion
This chapter describes the current state of the art in database technology for biomedical
research. It is an area that is being driven by both the day-to-day requirements of ex-
perimentalists, and strong theoretical work in computing science. The challenge of data
integration is so great because FG experiments often generate unexpected results that must
be investigated from various perspectives. In the past, a biological investigation required
9A path is the hierarchy of elements that the precede textual data in XML.
Chapter 2. Databases, standards and ontologies for the life sciences 76
a researcher to have a comprehensive knowledge of a particular organism, organ, or set of
genes. The situation now is far more challenging, as the results from a functional genomics
experiment could lead an investigator into a great variety of domains. For example, the
top 200 hits from a microarray investigation on liver samples could contain genes that had
previously not been implicated in liver function at all. This would require the investigator
to determine the function of the genes from a number of different angles: protein structure,
modifications, biochemistry analysis from databases or the literature, and several others.
Many of the new developments presented in this chapter aim to improve the facilities for
automating the retrieval of this information.
The vision of the Semantic Web is one of the driving forces of the work on standards and
ontologies, but its realisation is some way off. The technologies that will be used to create
the Semantic Web can be put into practice now, and will greatly improve the capabilities
of computer systems. It is clear that there are major advantages to the use of ontologies in
databases, for web publishing of data and in exchange formats, which are as follows. Firstly,
the problem of synonyms in the life sciences is significant. The names of genes and proteins
have been assigned in the last few decades, often based on some phenotypic characteristic that
has limited relevance now that comparative genomics can discover the same gene in different
organisms. For example, the “wingless” gene in Drosophila has a number of synonyms
reflecting the range of roles that it has in different parts of the organism. It was named
because when its function is removed, the flies have no wings, which is clearly of limited
relevance for the same gene in humans. It is becoming apparent that a new organised
naming system is required that takes into account the role of a gene in different organisms.
This is one of the areas that will be aided greatly by the Gene Ontology. A second problem
is finding a common description of how experiments have been performed (the methods).
An ontology-based description of experimental protocols will aid the retrieval of experiments
stored in databases, and may allow future reasoning over different experiments to find how
they are different. For example, it may be possible to find automatically the genes with
altered expression that differentiate two strains of an organism, if the description of each
strain is well structured using ontologies. The synonym problem also arises in the description
of protocols. For instance, in the description of microarrays in the previous chapter, I avoided
the term “probe”, which is frequently used in the methods section of microarray publications.
“Probe” is used by some groups to mean the features deposited on the array, and by others
to describe the labelled mRNA that is hybridized to the array. It is hoped that the MGED
Chapter 2. Databases, standards and ontologies for the life sciences 77
Ontology (MO) can remove these kinds of problems because it contains terms with strict
definitions that are not open to confusion, and therefore software can be developed to search
for particular terms, knowing that queries will be answered correctly. If MO can be extended
to describe all functional genomics investigations, and gains widespread usage, we will be
someway towards solving the problem of imprecise language that hinders automated analysis.
The work on data standards is essential to allow the creation of public databases that
can be queried, and to allow data sets to be downloaded in bulk for re-analysis with new
statistical techniques. The object models are a vital component that allow developers to have
a shared understanding of large systems. The models also allow software to be developed
for creating standard output in an exchange format, and can act as a bridge between flat
files and database storage. In particular, MAGE-OM has influenced efforts in other parts
of functional genomics because it has gained widespread community acceptance, and it is
forward thinking in the use of ontologies.
It could be argued that many of the data integration methods currently being developed
will not be required if the Semantic Web is successful. In reality, it is the data integration
efforts that are currently on-going that will evolve into the vision of the Semantic Web. The
development of ontologies to describe biological knowledge as it is now, and to describe the
experimental process that was used to produce the knowledge, will be vital. In addition, the
schema matching techniques, that aim to find the commonality between different databases,
will be a vital intermediate step towards fully interoperable systems. The data integration
methods will help us to learn how different data are structured, and how they can be described
in common terms.
The solution to the data integration challenge is still an open research question. The
majority of research groups performing functional genomics investigations are left with a
laborious, time consuming task, often involving manual Web browsing, to assimilate infor-
mation about genes, proteins and pathways. If systems can be developed that automate this
process, they will free up large amounts of research time that could be better spent else-
where, and new knowledge will be derived by discovering the relationships between different
components of biological systems.
2.8 Conclusions
In this chapter, a brief overview has been given of the different databases that exist for storing
functional genomics data, and the data integration challenges that they present. It is vital
Chapter 2. Databases, standards and ontologies for the life sciences 78
that data standards and ontologies are created to allow researchers to exchange and transfer
data sets to central repositories. An overview of the major proposals has been given. In the
following chapter, the current status of proteomics data standards is described. The main
focus is the development of a new object model that supplements the first draft standard,
and there is a discussion of the future of data sharing and publishing in proteomics.
Chapter 3
An object model for proteomics
3.1 Introduction
The first two chapters outlined the computational requirements of functional genomics ex-
periments, and previous work in databases, standards and ontologies for life sciences. This
chapter comprises two parts, the first focuses on the development of an object model to cap-
ture proteomics data, which we released as a proposal for a standard data format and was
published in October 2003 [176]. The first model is referred to as Gla-PSI (Glasgow proposal
for the Proteomics Standards Initiative) and covers studies in which proteins are separated
by two-dimensional gel electrophoresis (2-DE), and identified by mass spectrometry (MS).
Gla-PSI was developed to supplement the draft standard for proteomics, the Proteomics
Experiment Data Repository (PEDRo) originating at the University of Manchester, which
was released in January 2003 [315]. The latter part of the chapter outlines the continued
development of the official standard of the Proteomics Standards Initiative with which we
have been involved. The new object model, PSI-OM (Proteomics Standards Initiative Ob-
ject Model), was initiated in 2004 at the annual meeting of the PSI. We have contributed
to the development in collaboration with other members of PSI. PSI-OM has evolved from
PEDRo, and includes parts of the data model from Gla-PSI.
3.1.1 The emergence of proteomics
The challenge in proteomics research is to characterise the expression of all, or as many as
possible, of the proteins in a sample of interest. Comparative analysis may also be carried
out to determine the difference in protein expression between two or more samples, and the
differences may provide clues as to proteins that are critical for the process being studied,
such as a disease. Two dimensional gel electrophoresis (2-DE) is frequently used to separate
79
Chapter 3. An object model for proteomics 80
proteins into discrete spots that may be quantifiable, and mass spectrometry (MS) is often
used to determine the observed mass of the peptides in the protein. Observed peptides masses
can be used in a search against a sequence database to identify the protein. There are also
new protein separation techniques, such as multi-dimensional chromatography and affinity
methods, for determining the proteins that are present in a sample. The core techniques of
2-DE and MS have been available to researchers for several decades but it is only in recent
years that large scale analysis has become feasible, forming the field of proteomics. There
have been gradual improvements in the experimental protocols for 2-DE and several new
stains have been designed that improve the linearity of the relationship between visible stain
and the actual amount of protein in a gel spot. Software has been developed for improved
detection and quantification of spots on gels, and matching spots between different gels. The
technology for MS has also moved forward with improved ionisation protocols and detection
mechanisms (described in Chapter 1). However, the main reason for the major shift in
research paradigm towards the global approach has not been related to the improvements
in 2-DE and MS, but can be attributed to the vast increase in the availability of DNA
and protein sequence data in the genome databases. MS is only a good method of high-
throughput protein identification, if protein sequences are deposited in a database, or very
closely related sequences exist. Therefore, without the major sequence databases, it would
not be possible to perform large scale proteomic investigations.
3.1.2 Publication of data
In other areas of biology, the deposition of data in a central repository is a prerequisite for
publication: DNA sequences must be deposited in GenBank [30] and protein structures in
the PDB [253]. At present, public access to large amounts of proteomics data is limited.
There is a database of 2-D gels, hosted at the Swiss Institute of Bioinformatics, known as
SWISS-2DPAGE [153], and a similar effort at Argonne national lab, called GELBANK [20].
Both databases offer images of 2-D gels that can be browsed, providing access to limited
information about spots identified on gels (described in Section 3.3). However, the general
availability of proteomics data is poor, and most journal publications only display gel images
and a table of proteins that have been identified. Essentially this information is inaccessible
to computational analysis, even if the data is placed on the author’s web site, because there
is no common mechanism for querying or finding the data. The same issue exists for the
related fields of phylogeny and immunohistochemistry where diagrams of trees, or images
Chapter 3. An object model for proteomics 81
of cells, are reproduced in journals but are not open to computational analysis. The rate
of production of data is too great for researchers to have complete awareness of the protein
expression data that could relate to the system they are studying, and if there is no change
in the way proteomics data is represented, the situation will become far worse.
3.1.3 A central repository for proteomics
There is a major requirement for the development of a central proteome database that
includes 2-DE images, their analysis and MS data. For such a plan to be realised, it is vital
that a standard data model is adopted by the research community to enable experiments
from different laboratories to be compared or queried. A central database must contain
sufficient detail about experimental protocols for the context of the experiment to be fully
understood. It is also important that statistical analysis is captured, to ensure that new
results derived from data are electronically accessible, and can be verified. It is only in
the last two years that efforts have been initiated to develop a standard data format for
proteomics, resulting in the PEDRo proposal released in January 2003 [315]. A community
wide proteomics standard is still some way off, even though in the related field of microarray
analysis the data format MAGE-ML [297] has become well established in a relatively short
time frame. There are several reasons for the delay in finding a consensus on a standard.
The most significant challenge is the complexity of proteomics experiments compared with
microarrays. The identity of each feature on a microarray is known in advance, and matching
data points across a set of microarrays is a trivial task. In proteomics, proteins spots on a
2-D gel have to be identified by some process, which may be error prone, single spots may
contain multiple proteins and multiple different forms of the same protein can appear in
several positions on one gel. The reproducibility of 2-DE has improved greatly but is still
far from 100%. There are also various statistical models of the match quality for proteins
identified with MS data, but no single standard that can be compared across experiments.
The result is that a single proteomics data set is complex, and the experimental methodology
is rarely homogenous across different laboratories. This presents a major challenge because
it is difficult to create a model that captures all the methods used, and data that may arise
in proteomics experiments. In consequence, heterogeneous data formats are used, which are
difficult to load into a central database that supports queries over experiments produced by
different laboratories.
The expression of all the proteins in a disease sample compared with a normal sample
Chapter 3. An object model for proteomics 82
can facilitate understanding the disease process but the information can also provide an
additional level of information to sequence databases [4]. For example, if experiments to
determine the proteome of human liver cells reveal that a specific protein is abundantly
expressed, the information is functionally significant and should be available to researchers
accessing sequence databases. Additionally, gel spots analysed by MS may reveal peptides
that match a region of genomic sequence that has not been annotated. Therefore, the peptide
sequences can be used to discover new genes, or edit incorrectly annotated genes.
The global protein profile generated by experiments depends upon the conditions under
which the sample was produced and processed, prior to separation by 2-DE. The data may
be valuable to researchers in diverse fields, who could obtain new results from data sets
originally intended for another purpose. Therefore, it is vital that experimental protocols are
rigorously documented, according to a shared standard, and stored in a structured format
that allows searches over biological conditions: species, cell, tissue type; or experimental
conditions, such as: gel constituents, stain, or MS instrument parameters.
3.1.4 The status of proteomics standards
The Proteomics Standards Initiative (PSI) [257] was formed by the Human Proteome Organ-
isation (HUPO). So far, there have been two annual meetings at the European Bioinformatics
Institute [236, 237] and one meeting in Nice, France in 2004. The PEDRo proposal for a stan-
dard was released to demonstrate that a universal proteomics data format could be feasible,
and to stimulate discussion from the proteomics community about the requirements for a
standard (described in detail in Section 3.3). PEDRo focuses on the experimental techniques
used by proteomics researchers. Gla-PSI was developed at the same time as PEDRo, but
was modified following the release of PEDRo to model in more detail the protein data that
arises in a proteomics experiment (described in Section 3.4). Gla-PSI models 2-DE data, dif-
ference gel electrophoresis, image analysis and statistical analysis of large data sets (Figure
3.1). These data types are not adequately covered in PEDRo and therefore Gla-PSI acts as
a proposal for additional information that should be captured in the community standard.
Gla-PSI allows researchers to store data from any of the image analysis applications that
are available. Statistical analyses performed on data produced from image processing, such
as software, algorithms and the associated parameters, can also be captured. The model
is further specialised to manage difference gel electrophoresis data. Gla-PSI links spots
visualised on a gel, to proteins that have been identified by MS. The model is not a proposal
Chapter 3. An object model for proteomics 83
BiologicalSample
BiologicalSample
BiologicalSample
Legend
Sample Flow
Data Flow
Search
Mass Spectrometry MS/MSMALDI
Sequence Database
Solubilisation
DesignExperiment
StatisticalAnalysis
Image Analysis
Overview of a Proteomics Experiment
Protein Identification
ID Vol X Y Protein
1 454 23 24
2 222 28 87 abc1
3 12 20 12
4 662 262 101
1 454 23 24
2 222 28 87
3 12 20 12
4 662 262 101
1 454 23 24
2 222 28 87
3 12 20 12
4 662 262 101
ID Vol X Y Protein ID Vol X Y Protein
Global Expression Profile
Protein
2D−PAGE
Figure 3.1: The data flow in a proteomics experiments. The parts of the analysis covered byGla-PSI are boxed.
Chapter 3. An object model for proteomics 84
for annotation standards for MS, however there are a number of groups working towards a
standard for MS under the auspices of PSI, described in Chapter 2. PSI will oversee the
development of a complete model for proteomics that encompasses sample origin, 2-DE and
MS. The current status of proteome standards is presented in Section 3.5.
A new model, PSI-OM (PSI object model), is under development following several work-
shop meetings. The new model has evolved from PEDRo and includes part of the data
model from Gla-PSI. PSI-OM will ultimately be merged with the microarray data model to
form a single unified standard for functional genomics, as described in the following chap-
ter. It has been recognised during the development of microarray standards that controlled
vocabularies (ontologies) are critical for the creation of systems that have enough flexibility
to capture a wide range of experiment types, and allow the information to be queried in
complex ways. An ontology for proteomics is under development, as described in Section
3.5.3. A major contribution towards microarray standardisation was the release of a set
of guidelines for researchers wishing to publish, known as MIAME (Minimum Information
About a Microarray Experiment) [41]. A similar effort is underway in proteomics that will
be released in late 2004 or early 2005 (Section 3.5.4).
The rest of the chapter is structured as follows. Section 3.2 describes the methodology
used to develop Gla-PSI, and how requirements capture was carried out. The previous work
in proteomics data formats and standards is given in Section 3.3. A detailed description
of Gla-PSI is given in Section 3.4. The future development of a community wide proteome
standard, an ontology and guidelines for publication are described in Section 3.5. Section 3.6
includes a discussion of the importance of standards for proteomics, and the current status
of public access to proteomics data.
3.2 Methods
The early stages of developing a standard involved the creation of a prototype database for
2-DE and MS data by the author for a Master’s degree by research [174]. The database high-
lighted the challenges of integrating heterogeneous data types, and capturing experimental
protocols, in a structured format. The prototype demonstrated that many types of questions
that biologists posed could not be answered using the current technology, which would be
solved by the development of a central repository and appropriate query tools.
Case studies into proteomics investigation have been carried out (Chapter 1) which
demonstrated the requirement for new bioinformatics tools to facilitate the analysis of large
Chapter 3. An object model for proteomics 85
protein data sets. The case studies also highlighted significant challenges in data integra-
tion and systems development, and found several areas in which proteomics techniques are
employed:
• Proteome cataloguing: determine the entire set of proteins expressed in a cell type,
organelle or microorganism.
• Hypothesis generation: discover proteins whose function may be important in the
condition of interest.
• Protein regulation: discover sets of proteins that share patterns of expression across
a range of sample conditions.
• Correlating gene and protein expression.
• Post-translational modifications: which include phosphorylation, glycosylation
and acetylation.
The case studies also revealed that a critical factor required for aiding proteomics research
is the development of a data standard. Therefore, Gla-PSI was initiated to model data
from 2-DE, difference gel electrophoresis, image analysis and statistical processing. The
development of the model was driven by analysis of real data sets, and an understanding
of the types of queries that researchers would like to pose. The experimental basis for Gla-
PSI was established over a significant period in which requirements capture was performed
(Table 3.1). A number of interviews were held with principal investigators in laboratories
performing proteomics investigations. Time was also spent shadowing bench researchers to
gain a better understanding of the techniques involved in the research. Finally, literature
surveys were performed into functional genomics investigations, databases for life sciences,
and data standards in other fields to learn what procedures are commonly used to model
complex domains. During the development of Gla-PSI, regular meetings were held to present
the model to biological researchers, gaining feedback to ensure that a database based on the
model would cover all the data types that are required.
The data flow shown in Figure 3.1 outlines the stages in which information must be
captured in a proteomics experiment, and the boxed area represents the part of the anal-
ysis covered by Gla-PSI. Gla-PSI is expressed in Unified Modeling Language [324] (UML,
described in Chapter 2) and was developed using the UML modelling tool Rational Rose
[266]. Gla-PSI comprises class diagrams in UML to represent the concepts, objects and
relationships in a proteomics experiment.
Chapter 3. An object model for proteomics 86
Name Position Meet-ings
TimeSpan
Description
DrJonathanWastling
Principalinvestigator
50 2001-2004 Dr Wastling runs a laboratory that uses pro-teomic techniques to investigate parasitol-ogy. Many meeting were held in which dif-ferent proteomic technologies were discussedalong with the computational challenges theypresent.
AudeFoucher
PhDstudent
5 2001 Miss Foucher supplied data sets for the firstprototype database and evaluated the system.
AdrianCohen
PhDstudent
5 2001 Mr Cohen used proteomics to catalogue theexpressed proteins in the parasite Toxoplasmagondii and supplied test data for the first pro-totype database.
Dr ChrisWard
Post-doctoralresearcher
5 2002-2003 Dr Ward presented his work using proteomicsto identify the proteome of an organelle fromToxoplasma gondii and supplied data for test-ing.
Prof.WalterKolch
Principalinvestigator
5 2003-2004 Prof. Kolch is head of a laboratory at theBeatson Institute for Cancer Research. Thefuture developments of proteome databaseshave been discussed on several occasions.
Alex vonKriegsheim
PhDstudent
3 2003 Mr von Kriegsheim is a researcher at the Beat-son Institute for Cancer Research and per-forms DIGE analysis. The coverage of theGla-PSI model was discussed in a series ofmeetings.
MoragNelson
PhDstudent
30 2002-2004 Miss Nelson is investigating the differential ex-pression of proteins in host cells when invadedby a parasite, compared with non-invadedcells. Miss Nelson produced the data that isanalysed in Chapter 6.
Prof. MikeTurner
Principalinvestigator
5 2003-2004 Prof. Turner is head of a laboratory thatinvestigates the mechanism of action of try-panosomes and malaria. One of the techniquesemployed is proteomics. There have been sev-eral discussions of the requirements for thedatabase and the annotation of the genomesequence.
AnneFaldas
Researchassistant
20 2003-2004 Miss Faldas is cataloguing the proteome of theparasite Trypanosoma brucei (Chapter 7).
Table 3.1: A summary of the interviews held with researchers to formulate an understandingof proteomics research.
3.3 Previous work
Gla-PSI was released as a proposal for information that should be captured in a community
standard for proteomics, in addition to what is captured in PEDRo. In this section a
detailed description of PEDRo is given, along with a brief description of other data formats
for proteomics.
Chapter 3. An object model for proteomics 87
3.3.1 SWISS-2DPAGE
The SWISS-2DPAGE system was first established in the early 1990s as a web repository of
2-D gel data [153]. The web interface contains gel images overlaid with a map of spots which
has hyperlinks to other web pages for individual spot records. The spot records can be linked
to corresponding entries in the protein sequence database Swiss-Prot. The functionality of
the database is discussed in more detail in Chapter 5. The system utilises a textual data
format for specifying 2-DE and protein spot data, which is similar to the format of the Swiss-
Prot database, and was considered as a candidate format during the standardisation process
(see the SWISS-2D PAGE website for a sample record [309]). The format contains some
information about how the protein was identified, such as the peaks produced from mass
spectrometry, and can incorporate links to bibliographic references and other databases.
However, there is limited information about the protocols employed to create the gel. The
format does not include the method of scanning to create the gel image, or the software used
to analyse the image. There is also only a very limited minimum set of information that
must be supplied, therefore certain entries contain only the protein name, species of origin
and identifiers for the protein and gel. A data standard for proteomics requires a wider and
more complex specification of the minimum information that should be captured for each
protein entry.
3.3.2 GELBANK and HUP-ML
A similar format is produced by the GELBANK database (the data format is displayed in
Babnigg and Giometti 2004 [20]). The GELBANK text format is similar to SWISS-2DPAGE
but contains slightly different information about the gel protocol, and has different format-
ting. Protein spots are stored with the following information: gel position, the observed
molecular weight (MW) and charge (pI) of the protein, the theoretically calculated MW and
pI, the protein name and its accession number. There is no current facility for linking to MS
data that would enable the quality of the protein match to be assessed. The Japanese Human
Proteomics Organisation (J-HUPO) has also produced a proteomics data format, HUP-ML
(HUman Proteome Markup Language) represented in XML. HUP-ML has been presented at
past PSI meetings, and contains more detailed information about sample processing prior to
2-D gel electrophoresis. There is a DTD (document type definition) available for validating
HUP-ML [160]. The developers of HUP-ML are committed to the PSI development process
and will produce a mapping from HUP-ML to the finalised standard of PSI.
Chapter 3. An object model for proteomics 88
Figure 3.2: The complete PEDRo model represented in UML, reproduced from [315].
Chapter 3. An object model for proteomics 89
Figure 3.3: The classes that record biological samples in PEDRo, reproduced from [315].
3.3.3 PEDRo
The Proteomics Experiment Data Repository (PEDRo) from the University of Manchester
was created to address the requirements for a proteomics standard and covered four parts
of the analysis: sample generation, sample processing, MS protocols, and MS data analysis.
The complete PEDRo model is displayed in Figure 3.2, the four parts of the analysis are
represented by different shading in the four sections of the model. The sample generation
part is shown in Figure 3.3. An overview of the experimental hypothesis and citations for
methods and results are captured in the class Experiment. There is a relationship to the class
Sample and SampleOrigin for recording basic details about the type of material on which the
experiment is being performed, along with genotype information in Organism. PEDRo was
originally designed for capturing data from experiments with yeast, therefore the description
of sample is focused on cell cultures and has very limited facilities for recording any detailed
phenotype information about larger organisms.
Protein separation in PEDRo
Figure 3.4 summarises the classes for capturing protein separation techniques. The Sample
class is a subclass of Analyte (Figure 3.3) and separation techniques are modelled as sub-
classes of AnalyteProcessingStep. The substance on which a separation technique is per-
formed (the input) is modelled by a relationship from Analyte to AnalyteProcessingStep.
Sample, a subclass of Analyte, is thus directly related to the first separation technique
(AnalyteProcessingStep) performed on it. The separation techniques are modelled by
Chapter 3. An object model for proteomics 90
Figure 3.4: The part of PEDRo covering protein separation techniques, reproduced from[315].
classes, such as Gel, Column and ChemicalTreatment. The products of separation (outputs)
are modelled by the classes GelItem, Fraction and TreatedAnalyte. The inheritance re-
lationship enables a series of treatments to be specified where the product (output) of one
treatment becomes the input for another. 2-D gel data is represented by the attributes in
GelItem, and spots matched between gels can be captured in RelatedGelItem. The method
used to perform comparative gel analysis is not recorded in PEDRo.
Mass spectrometry in PEDRo
The third section models the type of ion source for a mass spectrometer and the machine
parameters (Figure 3.5). The protein sample, and its analysis, are represented by the re-
lationship from Analyte to MassSpecExperiment, enabling a link to a gel spot, column
fraction or output from another type of treatment.
MS data itself is represented in the fourth part of the model (Figure 3.6). The data in
MS is typically a list of peaks from an MS trace. Database searches that are carried out
to identify proteins from the MS data are captured by DBSearch and DBSearchParameters.
Peptides that are matched by the data are represented by PeptideHit and protein records
Chapter 3. An object model for proteomics 91
Figure 3.5: The model of MS ionisation and protocol in PEDRo, reproduced from [315].
Figure 3.6: MS data and database searches modelled in PEDRo, reproduced from [315].
Chapter 3. An object model for proteomics 92
that have been matched are modelled by ProteinHit and Protein. There is a relationship
between ProteinHit and RelatedGelItem that enables a direct link from gel spots to the
proteins to which they have been matched, without traversing the entire set of MS data and
analysis. There are a large number of attributes in most of the classes that are representative
of the properties of the experiment that researchers may wish to store. However, for certain
concepts it is very difficult to cover all the possible attributes, for example different database
search programs offer a large range of parameters that cannot all be explicitly specified in
the model. Therefore, the class OntologyEntry is used to specify additional attributes that
can be added where required, by obtaining the relevant term from a controlled vocabulary.
The development of an ontology for proteomics is introduced in Section 3.5.3, and there is a
detailed discussion of ontology usage in the following chapter.
3.4 Gla-PSI: A model for 2-D gel electrophoresis and analysis
This section includes a detailed breakdown of the components in Gla-PSI. The UML concepts
of classes and attributes are used to represent objects in a proteomics investigation, and
relationships have been created between classes to model the links between items in an
experiment. The complete model is shown in Figure 3.7, and the following sections describe
each part of the analysis in turn. A case study demonstrating how the model captures data
from a difference gel electrophoresis experiment is given in Appendix D.
3.4.1 Overview of the experiment and protein extraction
Gla-PSI does not contain a complete proposal for describing the overview of an experiment,
however we believe that there are classes in MAGE-OM that can adequately describe the
hypothesis of a proteomics investigation and the biological samples used. Experimental
protocols for recording protein extraction and solubilisation can also be described in MAGE-
OM. In our original publication describing Gla-PSI [176], the exact details of how protein
samples and protocols can be recorded in MAGE-OM were not given, however the following
chapter describes the complete integration.
3.4.2 Two-dimensional gel electrophoresis
A complex mixture of proteins can be separated by a number of techniques, including: two-
dimensional gel electrophoresis (2-DE), chromatography, affinity column and others. Gla-PSI
is focused around 2-DE, which is the most widely used technique for protein separation in
Chapter 3. An object model for proteomics 93
IDEvidence
MassSpec
The stages preceding image analysis have been presented in models: MAGE http://www.mged.org and PEDRo http://pedro.man.ac.uk
Class A
Class B
New classes inthe model
Classes derived from MAGE or PEDRo
Legend
Database
version : StringURI : String
Identifiable
identifier : Stringname : String
All classes are subclasses of Identifiable and Describable (not shown). Therefore, all classes can have an identifier attached and be linked to annotation classes.
Figure 3.10: The relationship between spot data (Spot) and identified proteins (Protein).The evidence for a spot being matched to a protein, such as MS data, can be added to therelationship, although Gla-PSI does not have a specification of MS data.
records the software package and a description of image processing that has occurred.
3.4.4 Protein spots
There are separate classes for spots identified on a gel (Spot), and proteins (Protein) to
which spots may be matched (Figure 3.10). The relationship between Spot and Protein
allows one or more spot records to be linked to one or more protein records. The cardinality
is displayed by 0..n to 0..n on the relationship between the two classes. The relationship from
Spot to Protein is modelled in this way because there are known instances where a single
spot contains a number of different proteins. In the opposite direction, it is possible that
a particular protein arises in a number of different positions on one gel. The relationship
Figure 3.13: Several classes are subclasses of Identifiable, enabling a unique identifierand name to be attached. Each class is also a subclass of Describable enabling links tobibliographic references and external database entries to be specified.
3.5.1 An overview of PSI-OM
An overview of the new model is displayed in Figure 3.14. The main features of
the experimental techniques are similar to PEDRo, with a cycle from Analyte to
AnalyteProcessingStep. There has been no current effort to specify a detailed descrip-
tion of a sample within the model, however there is a relationship from SourceInformation
to OntologyEntry to specify characteristics of a sample. At the top level is the class
MIAPEDataSet for clustering a set of related proteomics experiments, below which is the
top level of one complete analysis (Project). The concept of a StudyGroup has been in-
troduced for comparing one set of samples with another. For example, an experiment is
performed to compare mice with a gene knockout X, against wild-type mice. Ten gels are
performed, of which five are replicates from pooled samples of knockout mice tissue, and five
are replicates from wild-type. An instantiation of PSI-OM would contain one instance of
Project and two instances of StudyGroup (one for wild-type and one for knockout). The
source of biological material is captured in the class Source. The model allows either: 10
sources of material to be specified for biological replicates (10 different mice) or two sources
of protein that is subsequently split, using the classes Subdivision, for specifying technical
replicates.
Chapter 3. An object model for proteomics 100
PercentOfComponent Timepoint
1 11 1
MobilePhaseComponent
1..n
1
1..n
1
Column
SampleLoading
Fraction ColumnRun
1..n
1
1..n
1
1
0..n
1
0..n1..n
1
1..n
1
0..n 10..n 1
CombinedAnalytes Combination
1 2..n1 2..n
AnalytePortion Subdivision
2..n 12..n 1
TaggedAnalyte TaggingProcess
1 11 1
Description
RunDetails
StudyDescription
experimentalFactor
Analyte AnalyteProcessingStep
0..n1 0..n1
Protocol
Source
1..n
1..n
1..n
1..n
OntologyEntry
SourceInformation 1..n1..n 1..n1..n
+type
+characteristicsOtherAnalyte
OntologyEntry
1
1 +type
1
1
OtherAnalyteProcessingStep
0..n0..n 0..n0..n
1
1+type
1
MIAPEDataSet
StudyGroup
0..n
1
0..n
1
Project
hypothesis
0..n
1
0..n
1
0..n
1
0..n
1
RunDetails
Description
PhysicalGelSpot Gel2D
1
1
1
1
1
1
1
1
0..n 10..n 1
Analysis
0..n1 0..n1
StudyGroupDataSet
1 1
ExpressedProtein
1 1..n1 1..n
Gel1D
1
1
1
1
1
1
1
1
PhysicalBand Gel1DLane
0..n
1
0..n
1
0..n 10..n 1
1
1 1
Figure 3.14: A draft version of the main components of PSI-OM.
Chapter 3. An object model for proteomics 101
See DataModel diagram for link between Image, ImageAnalysis and IdentifiedSpot / Band
Gel1DGel2D
Image
URI : Str...
ImageAcquisition
0..1
0..n
0..1
+scans1DGel
0..n0..n
0..1
+scans2DGel
0..n
0..1
0..n
0..1
0..n+createsImage
0..1
DatabaseEntry
IdentifiedBand
Analysis
IdentifiedSpot
DIGECompositeSpot OntologyEntry
MSDataCapture
ProteinRecord
0..n
0..1
0..n
0..1
PhysicalBand
0..n 10..n 1
StudyGroupDataSet
1
1
PhysicalGelSpot0..n 10..n 1
0..n
1
0..n
1
ProteinModification
1
1
+type
1
1
MSDataSet
0..1
ExpressedProtein
0..n
0..1
0..n
0..1
0..n0..1 0..n0..1
1..n
1
1..n
1
0..n0..1 0..n0..1
0..n1 0..n1
0..n
0..1
0..n
+proteinIdentification
0..1
Fraction0..n0..1 0..n0..1
1
1
0..1
+containsProtein
Figure 3.15: Part of PSI-OM showing the relationships between spots identified on a gel andthe corresponding protein records.
3.5.2 Data model in PSI-OM
The diagram in Figure 3.15 displays the overview of a proteomics data set. A number
of experiments are packaged together using the class StudyGroupDataSet. The core data
point is an ExpressedProtein which can be linked to a set of classes describing the result
of separation techniques (PhysicalGelSpot, PhysicalBand, Fraction and so on). The
class ExpressedProtein will capture a complex concept, as follows. In a 2-DE experiment
particular proteins may appear in multiple positions on a 2-D gel, which may be the result of
differential splicing of gene products or chemical modifications to the protein. These variant
forms of the protein will usually only be identified by a single protein name or accession
number, however it is vital that the alternative forms are differentiated in the model. An
ExpressedProtein is intended to capture the idea of a single protein form that arises in one
position on a gel, or in one column fraction, resulting from the set of modifications that it
has. If the nature of the modification is known, it can be captured in ProteinModification,
and a reference to a record in a sequence database can be captured in ProteinRecord and
DatabaseEntry. The current model has no detailed specification for MS standards because
these are in development by a separate organisation, and will be added to the model when
finalised.
The draft model of protein spot data arising from image analysis has been influenced
Chapter 3. An object model for proteomics 102
OntologyEntry
MultipleGelAnalysis
Image
URI : String
1
1
+format
1
1
ImageAnalysis
SpotsMatchedAcrossGels
1..n
1
1..n
1
SingleGelSpotSet0..n0..1 0..n0..1
1
0..1
1
0..1
11 11 DIGESingleSpotSet1 11 1
DIGEAnalysis
10..1 10..1
1..n
1
1..n
1
Image
URI : String
MultipleGelAnalysis
OntologyEntry
IdentifiedSpot
0..n0..1
0..n0..1
0..n
1
0..n
1
DIGESingleSpot
1
1..n
1
1..n
DIGESpotSet
1
1
1
1
0..1
0..1
+compositeImage
SpotsMatchedAcrossGels
1..n
1
1..n
1
SpotMeasurement
value : Double
0..1
0..1
+unit
0..1
0..1
1
1
+type
1
1
0..n0..1 0..n0..10..n
0..1
0..n
0..1
DIGECompositeSpot
1
1..n
1
1..n
1..n
1
1..n
1
0..n 0..10..n 0..1
0..n
0..1
0..n
0..1
0..1
0..1
Figure 3.16: A draft version of the protein data model in PSI-OM. The classes on the leftmodel conventional 2-DE and the classes on the right represent difference gel electrophoresis.
by Gla-PSI, and is displayed in Figure 3.16. There are two separate sets of classes for
modelling gel electrophoresis data. The classes on the left of Figure 3.16 model standard
gel electrophoresis, in which one sample is applied to one gel, and multiple samples are
compared on different gels. The classes on the right model data resulting from a DIGE
experiment, in which there are two kinds of spot data: spots arising from scanning a gel at
a single wavelength (DIGESingleSpot), and spots arising from a composite image that has
been calculated from the single channel images (DIGECompositeSpot). The attributes that
will be assigned to classes are still to be finalised, but one issue that must be resolved is the
extent to which ontologies will be utilised. It is possible to include many attributes in the
model for describing protein data, or put the types of attributes in a controlled vocabulary
and link many classes to OntologyEntry. This is an area for future discussion but we believe
that there are considerable advantages to using ontologies extensively, because the controlled
vocabularies can be updated at regular intervals, allowing gradual evolution of the coverage
of the model. It is not possible to update an object model at regular intervals without
generating backward compatibility problems.
3.5.3 An ontology for proteomics
The original PEDRO model used ontologies sparingly, taking the approach that an initial
model for proteomics should function as a document for specifying the main components of
Chapter 3. An object model for proteomics 103
a typical workflow to stimulate discussion in the community. The Gla-PSI proposal specifies
that ontologies are required to capture certain parts of the analysis, but there are currently no
major ontologies containing proteomic experimental terms. Therefore, it has been recently
proposed that the MGEDOntology should be extended for proteomics. The MGEDOntology
(MO) includes a controlled vocabulary of terms describing microarray experiments, including
the details of biological samples (described in more detail in the following chapter). There
is no difference between the sample prior to mRNA extraction for a microarray assay or
protein extraction for proteomic analysis, hence parts of MO can describe biological samples
for proteomics. A new ontology, PSI-Ont, is in development and will include terms describing
proteomic experimental techniques. PSI-Ont will be developed as an extension to the MGED
Ontology, and will follow the same structure.
3.5.4 Minimum information about a proteomics experiment
An essential stage in improving the process of exchanging and publishing microarray data
was the release of the MIAME guidelines [41]. MIAME is a checklist of the information that
should be made publicly available to allow the data sets to be re-analysed, or to allow the ex-
periment to be reproduced, if identical biological samples are available. An equivalent effort
has been initiated by PSI to develop MIAPE (Minimum Information About a Proteomics
Experiment). The guidelines will be formalised after a series of meetings and discussions
via the mailing list. In overview, we believe that MIAPE should contain the following. It
is vital that sufficient description of the biological samples is given so that the validity of
each study group can be established. Researchers should also publish the protein extraction
protocols, detailed descriptions of the protein separation techniques, and the equipment and
protocols utilised for MS. Any software that is used to analyse data should be reported with
a version number, vendor name and contact details. If database searches have been carried
out to identify proteins, there should be a date stamp of when the search was carried out if
the database is updated daily, or a version stamp if the database is released less frequently.
3.6 Discussion
3.6.1 Web access to date
It has been recognised that past funding for large databases of scientific data has not been
sufficient, and as a result, important information is lost [209]. An activity which attempts
Chapter 3. An object model for proteomics 104
to remedy this situation is the effort to develop biochemical pathway databases, such as
KEGG [184]. Information regarding reaction kinetics and functional information has been
published over several decades, but is not generally available in electronic form. Only papers
published in the last decade may be available on the Internet, and data is not presented in any
kind of format that can be mined automatically. Instead, information retrieval techniques
must be used with significant manual intervention. This process is time consuming and will
miss substantial amounts of information. Today, data regarding one biological system is
often too extensive for a single researcher to gain access to by reading published literature,
and automated methods are required. Microarray experts have previously recognised these
needs and efforts are underway to develop large central repositories [42]. In recent years a
parallel effort has been initiated by proteomics researchers, however there are currently no
major central repositories of proteomics data [252]. A standard data format will facilitate
the creation of a central repository that will allow re-analysis of published data as new
statistical techniques are developed. Microarray and proteomics experiments generate large
amounts of data that is of potential use to researchers in many other fields. In particular,
the studies can improve genome annotation by demonstrating conditions in which genes or
proteins have been shown to be up or down regulated, allowing researchers to improve the
functional annotation.
3.6.2 Status of proteome standards
This chapter documents the development of the Gla-PSI model, which we released in October
2003. Gla-PSI represents data from one section of a proteomics workflow and complements
other work undertaken by various organisations. PSI is overseeing the development of a
standard, and is using PEDRo as an initial framework from which to develop a unified
model. Gla-PSI covers image analysis of 2-DE, multiple gel comparison, DIGE and statistical
analysis of large data sets, and represents additional information that should be included
in the next version of the community standard. Capturing experimental protocols in a
structured format is a major challenge due to the enormous range of possible experiments
that could be performed. The MAGE format for microarray has been designed with a flexible
structure that allows it to be extended into new technologies by using ontologies. Gla-PSI
utilises parts of MAGE for adding additional annotation and bibliographic references to
the model. In our original publication on Gla-PSI [176], we stated that classes derived
from MAGE should be used for capturing information about experimental protocols and the
Chapter 3. An object model for proteomics 105
biological samples on which experiments are performed but at that time the integration had
not been completed. The following chapter describes later work, which is the integration
of Gla-PSI, PEDRo and MAGE to create a framework for capturing data from a range of
functional genomics techniques.
In Section 3.5, the development of the next version of the official PSI object model (PSI-
OM) was discussed, which incorporates parts of Gla-PSI and PEDRo. The development of
the object model will take place in conjunction with the creation of an ontology for pro-
teomics (PSI-Ont), which will be regulated by PSI. An important first stage will be the
creation of a document that specifies the minimum information set that must be published
alongside proteomics data to allow future re-analysis (MIAPE). The development of all three
components (PSI-OM, PSI-Ont and MIAPE) will continue with discussions at official meet-
ings of PSI, and via an email mailing list. The development of a finalised standard requires
significant contribution from the proteomics community before consensus can be reached.
The complete model should be flexible with regard to new technologies and experimental
protocols. A data standard should not prescribe how researchers carry out experiments,
but should capture enough detail to ensure that useful data archives can be developed. If
a standard is to be accepted, tools must be developed which enable researchers to capture
data conforming to the standard without substantial manual data entry. Laboratory Infor-
mation Management Systems (LIMS) are available from commercial software vendors. They
capture instrument parameters, and track solutions using bar-coding. It is likely that future
versions will be specifically tailored for proteomics applications, and software vendors should
provide an output file conforming to the proposed standard. A data set containing 2-DE
images, MS traces, analysis and annotation is fairly bulky, therefore the development of a
single public database covering all aspects of proteomics is unlikely for all species. A more
feasible solution is the development of distributed, domain specific proteome databases, such
as single organism, or disease, with data transfer between databases occurring via an XML
data format, created from the object model. It is essential that databases provide wide
ranging query facilities to enable the development of applications that search for data sets
of interest. Data integration applications will be developed to link proteome databases to
other repositories, such as databases of sequences, motifs and structures.
Chapter 3. An object model for proteomics 106
3.7 Conclusions
Gla-PSI has been developed to represent 2-DE, image analysis, difference gel electrophoresis
and statistical processing. It was initially developed at the same time as the PEDRo proposal,
however it was later modified and released to document additional information that should
be recorded in a community wide standard. The model has influenced the development of
the next version of the standard, PSI-OM.
The microarray field has recognised the need for central data repositories and exchange
standards for some time. The additional complexity of proteomics experiments means that
the efforts are some way behind, and there are still no databases that offer access to protein
separation information, quantification data and mass spectrometry. The development of
a proteomics data standard will enable data to be sent to a public database. Chapter 5
describes a prototype system that could serve as a centralised public database for proteomics.
The database stores protocols and data from 2-DE and MS, and facilities for integration with
microarray results are demonstrated. We believe that the efforts of MGED in the microarray
field can be used directly for proteomics, and in the following chapter there is a description
of the unification of the proteomics proposals with MAGE-OM, to create a proposal for
standard across the whole of functional genomics techniques.
Chapter 4
Development of a data standard for
functional genomics
4.1 Introduction
In Chapter 2, the importance of data standards for life sciences was outlined and this was
further exemplified in the previous chapter with a description of the development of a data
model for proteomics. The success of the MAGE-ML format for microarrays demonstrates
the feasibility of a community wide standard for capturing data from a diverse range of
experiment types. This chapter covers the integration of the Gla-PSI model into a wider
proposal for functional genomics, which was published in July 2004 [175], which includes
substantial detail from MAGE-OM, and the draft standard for proteomics, PEDRo. The
new model is known as FGE-OM (Functional Genomics Experiment - Object Model) and has
been presented to the standards organisations for proteomics and microarrays as a proposal
for the integration of the current efforts in both fields.
URL: www.gusdb.org/fge.html
4.1.1 Requirements for standards
The motivation for integrating the current proposals for microarrays and proteomics is as
follows. It is becoming common for research groups to carry out experiments using multiple
types of technology as the cost of performing experiments has fallen. Several institutions
have semi-automated facilities offering a service for performing parts of experiments that
were previously very labour intensive. The functional genomics facility in Glasgow is one
example, offering a sequencing, microarray and proteomics service to researchers [293]. Re-
searchers now generate large volumes of data from diverse techniques that they wish to
107
Chapter 4. Development of a data standard for functional genomics 108
compare, or analyse side by side. There are several facets of experiments that can be de-
scribed using the same terms. An overview of a functional genomics (FG) experiment can
be described with a text description of the hypothesis, and a parameter that is varied be-
tween different samples, such as the different time points in a time course experiment. The
biological samples used in any type of FG experiment should be described using common
terms because this stage precedes the extraction of mRNA, proteins or metabolites and could
potentially be analysed downstream using any of the experimental techniques. Experimental
protocols from microarrays, 2-DE and other separation techniques can be described as a set
of sequential steps involving substances, actions and equipment. It may also be desirable
that all experiments are annotated with an audit trail, capturing when, where and by whom
the experiments were carried out. Data points in an FG experiment are usually genes or
proteins which may be quantified or localised in one sample compared with another. It is
therefore possible to create a framework containing the common parts of FG analysis as
part of an all encompassing data format. A shared format that has wide community accep-
tance would allow developers to create software capable of formatting all locally generated
FG data into one format that can be exchanged with other researchers or sent to public
databases. The format should be suitably designed such that there is no great overhead if
research groups wish to use only a subset of the entire model, for example if they are only
performing proteomics. It is likely that one single model for FG will require significantly less
effort for developers than creating software to manage four or five separate formats. Finally,
if experimental protocols are captured in a common format it will open up new possibilities
for comparing data produced from different methodologies, allowing researchers to have a
view of the biology that is nearer to the whole system level.
An integrated data format will also facilitate the development of public repositories for
storage and querying of functional genomics data. Microarray experiments are used widely
because a large number of assays can be performed concurrently, producing a large number
of possible leads about the genes that are significantly associated with a particular condition
or disease. However, while it had previously been believed that there is a correlation between
the expression of mRNA and protein [115], more recent studies have indicated that mRNA
level is a poor indicator of protein abundance [178]. Proteomics experiments can determine
the relative level of protein produced, therefore would be expected to be a better indicator
of the level of protein activity. Proteomics experiments can also give information about
post-translational modifications, which may have important effects on the function of the
Chapter 4. Development of a data standard for functional genomics 109
protein [240]. It is therefore desirable that microarray and proteomics data can be queried
in parallel to determine the extent of gene expression and the level of encoded protein that
has been observed for a particular gene. Protein and RNA expression data should also
be accessible with genomic data, to allow better annotation of the genome with functional
information derived from FG studies, such as protein X is up regulated under condition Y.
A current example of this functionality is offered by the SOURCE database [78], which can
be queried by gene name, and returns textual annotation about the gene, and the relative
expression values from different microarray studies in which it has been assayed. Single data
points from a microarray experiment may not be sufficiently powerful to determine how much
active protein was present in the sample at that time, but can provide functional evidence
if a gene is strongly expressed in a sample or condition, or conversely not expressed where
it might be expected. These kinds of results can be assayed by further experimentation and
lead to the formation of new hypothesis about the function of genes and systems as a whole.
Functional genomics databases should also incorporate information from other types
of study: immunohistochemistry and protein interaction studies, such as yeast two-hybrid
[107]. Such systems would enable data mining applications to be developed that search for
the factors that affect regulation of transcription and translation, and ultimately, protein
function. Integrated databases will aid the development of mathematical models capturing
the effects of changes at the system level, and could provide source data for the modelling
of metabolic pathways [336]. Data mining algorithms could then be employed to search for
genes that may be important in a condition of interest, such as drug targets for a particular
disease.
4.1.2 Status of standardisation
Data standards for proteomics, and other FG experiments, are at a much earlier stage than
microarrays (Figure 4.1). PEDRo was released as a draft proposal to stimulate community
discussion about what was required in a data standard and, aside from the data capture
tool released with PEDRo (PEDRoDC), there have been few implementations of PEML
(Proteomics Experiment Markup Language), the XML-based data exchange language based
on PEDRo. This is because PEML is a complex format, and therefore considerable effort
is required by developers to create software that produces PEML. Furthermore, the benefits
of producing output in PEML at this time are limited, because there are no major public
repositories that accept PEML as input. There are also several parts of PEDRo that do not
Chapter 4. Development of a data standard for functional genomics 110
Formation of PSI
Release of PEDRo
Release of Gla−PSI
Developmentof PSI−OM
Developmentof MAGE v.2
1999 2000 2001 2002 2003 2004 2005 2006
Formation ofMGED guidelines
MIAME
published
1996
Microarray Standards
Proteomics Standards
Advent ofmicroarrays
Release of FGE−OM andSysBio−OM
First objectmodel toOMG
Release of MAGE−MLv.1
v.1
First largescaleexperiments
Figure 4.1: A time line displaying the emergence of microarray and proteomics technology,and the efforts to standardise data formats.
adequately capture a proteomics workflow, the most important being insufficient descriptions
of biological samples, and no support for auditing. These two areas are captured in MAGE,
and this part of the object model has been refined over a significant period by a team
of experienced developers. It is vital that the next round of development in proteomics
standards makes extensive use of the experience gained in the development of MAGE. This
process has already begun with several MAGE developers giving oral presentations at the
2004 meeting of the PSI in Nice, France [257].
FGE-OM offers a possible framework for developing a standard across all FG experiments,
however an alternative proposal has been released known as CEBS (Chemical Effects in
Biological Systems) SysBio-OM. SysBio-OM was released after the creation of FGE-OM
therefore was not available for analysis at the time of development. A comparison of the
features offered by the two systems is made in Section 4.4. The future development of MAGE-
OM and the PSI data standard should take place jointly, using FGE-OM and SysBio-OM
as a framework around which it can be coordinated.
FGE-OM captures microarray and proteomics data, including separation techniques such
as two-dimensional gel electrophoresis (2-DE), and protein identification by mass spectrom-
etry (MS). The model also stores experimental protocols, raw data and data analysis. FGE-
OM comprises three namespaces that organise the classes in logical subsets: BioOM, Ar-
rayOM and ProteomicsOM (Figure 4.2). Substantial detail from MAGE-OM has been
Chapter 4. Development of a data standard for functional genomics 111
FGE-OM
Components common to all functional genomics experiments
Microarray specfic components
Classes modelling proteomicstechnologies
Top-level of theObject Model
Namespaces
BioOM
ArrayOM
ProteomicsOM
Figure 4.2: An overview of the FGE-OM object model. The model is divided into threenamespaces: BioOM, ArrayOM and ProteomicsOM.
used to develop BioOM (the part of the model that is generic), and ArrayOM (the parts
of the model specific to microarrays). BioOM contains a set of packages and classes that
describe an experiment using microarrays, proteomics, or potentially other functional ge-
nomics techniques. The ProteomicsOM namespace captures information from proteomic
specific technologies. The object model has been implemented as a relational database,
known as RAPAD (RNA And Protein Abundance Database), which is described in the
following chapter.
The rest of the chapter is structured as follows. Section 4.2 outlines the methodology used
to create FGE-OM. A detailed description of FGE-OM is given in Section 4.3 and Section 4.4
briefly describes the contents of the alternative SysBio-OM proposal, and compares it with
FGE-OM. Finally, a plan for how the development of an integrated standard for functional
genomics can take place is outlined in Section 4.5.
4.2 Methods
FGE-OM was developed using an evolutionary software development model. MAGE-OM
was imported into a UML editing tool, and changes were made to accommodate proteomics
data. The PEDRo object model has not been released in UML format, however a database
Chapter 4. Development of a data standard for functional genomics 112
schema has been released in SQL, which matches the object model very closely. Therefore,
the PEDRo database schema was reverse engineered and imported into the editing tool.
Additional classes were added manually from Gla-PSI where required. The initial develop-
ment involved the creation of class diagrams to model parts of proteomics experiments, using
components derived from MAGE-OM where possible. This was followed by a phase of dis-
cussion between several developers to test whether hypothetical proteomics workflows were
adequately covered in the object model. In cases where FGE-OM did not correctly model
a possible workflow, refinements were made to the model. The model was further refined
after the objects had been mapped to relations, and deployed as a relational database. At
the time of development there had been no complete implementation of the PEDRo model
or database schema, therefore several classes had to be refined to reflect real data sets.
FGE-OM was developed in UML using the modelling tool PoseidonTM[249], into which
the source models were imported. Poseidon has the advantage over other tools that there
is a version that is freely available, offering sufficient functionality to view and edit UML
class diagrams. It is vital that as many developers as possible have access to the object
model, beyond being able to view images of class diagrams. The main alternative, Rational
Rose [266], is expensive software which precludes many researchers from analysing models.
There is a major compatibility problem between the UML versions specified by different
vendors. UML is intended to be standard notation but there is currently no robust method
of transferring models between tools. An interchange format for UML, XML Metadata
Interchange (XMI) [356], has now been defined that may improve compatibility in the future,
but the current implementations of XMI only specify the contents of the model, not the
diagrams that have been drawn to represent the model. Therefore, once an object model
has been imported, diagrams must be redrawn by the developers, which is a laborious task.
4.2.1 Ontologies
An ontology can be described as the result of knowledge capture about a particular domain, in
a formal structure [138]. The use of ontologies in life sciences is rapidly increasing, because it
is believed that they can improve facilities for data re-use and integration [300]. The MGED
Ontology (MO) has been created to capture terms used in a microarray experiment [304].
Each entry contains a term, a definition and a specification for where the term should be used
in the model. An example term viewed with the OilEd editor [28] is displayed in Figure 4.3
(OilEd is described in Chapter 2). The ontology contains classes, properties and instances
Chapter 4. Development of a data standard for functional genomics 113
Figure 4.3: A screenshot of the term “Age” in the MGED Ontology viewed with OilEd.
(individuals in OilEd). A class is the type of information (e.g. Age), the properties of the
class are its attributes (e.g. “has Measurement” and “Initial time point”) and the actual
values are the instances (e.g. years). There is also a definition of the term, in the case of
Age: The time period elapsed since an identifiable point in the life cycle of an organism. If
a developmental stage is specified, the identifiable point would be the beginning of that stage.
Otherwise the identifiable point must be specified such as planting.
The class OntologyEntry from MAGE-OM is used widely to store terms obtained from
controlled vocabularies, along with the source of the vocabulary. Ontologies are vital for
capturing the complexity of biological samples used in functional genomics.
EXAMPLE: Two FG experiments are performed, the biological material of the first is a
cell culture grown in a specific medium, and the second is a tissue sample from a person
suffering from heart disease.
It would be extremely difficult to engineer a schema to capture this range of information in
a structured way. For example, without an ontology, a model to capture the species of origin
Chapter 4. Development of a data standard for functional genomics 114
could be designed with a class Species and an attribute scientificName. However, this
can pose major problems for querying due to the different ways a name could be represented,
consider: abbreviations, different classification systems and user errors. This problem was
avoided in MAGE-OM, by designing classes that have a relationship to OntologyEntry, for
instance called speciesName. The model would be instantiated by obtaining the value from
a taxonomic database, along with an ID number and a URL pointing to the source data.
In FGE-OM, OntologyEntry is used in this way in all three namespaces, and many of the
terms in the MGED Ontology can be used for both microarrays and proteomics.
EXAMPLE: A comparative 2-D gel analysis is being used on tissue from the hearts of two
samples of mice, one of which has a genetic defect. One characteristic that the researchers
want to capture is the gender of the mice.
The gender is specified by a relationship from the class BioMaterial to OntologyEntry
called Characteristics. OntologyEntry captures the category (Gender), the value (Male)
and the term’s definition. In many cases the usage is more complex because classes in the
ontology can have subclasses to build up a hierarchical structure, in fact Gender is a subclass
of Sex. The hierarchy is expressed in the object model by a reference from one instance of
OntologyEntry to another. The overall effect of the use of ontologies is the delegation of the
task of describing the domain to a different process, the ontology development, instead of
representing all concepts in the object model. This is advantageous because ontologies can
be easily extended without affecting the core functionality, but an object model must stay
fixed for a significant period of time, and cannot gradually evolve.
The MGED ontology will be extended further to incorporate standard terms used in
protein studies. There are examples of how the ontology has been implemented in a relational
database in the following chapter (page 170). Other ontologies, such as the Mouse Anatomy
Ontology [45] and the Plant Ontology [247], can also be used to describe biological samples
where required. The usage of other external ontologies will be vital because the MGED
Ontology will never contain all the terms to describe any kind of sample on which microarrays
could be performed. However, separate ontologies will become available from specific research
communities and, as long as the source and definition of a term is clearly stated, then
structured descriptions of biological samples can be captured. This will greatly improve the
facilities for querying databases in the future to find relevant data sets.
Chapter 4. Development of a data standard for functional genomics 115
Figure 4.4: A complete listing of the packages within FGE-OM.
Chapter 4. Development of a data standard for functional genomics 116
Experiment Protocol Bio-Material
Measure-ment
BioAssay BioAssay Data
BioEvent DescriptionBio-
SequenceBQS
HigherLevel
Analysis
Audit And Security
Identifiable
Extendable
Describable
Packages Classes
Figure 4.5: The packages and classes in the BioOM namespace of FGE-OM. The boxedpackages have been altered from MAGE-OM, others are identical to packages in MAGE-OM.Open arrows indicate inheritance, for example Identifiable is a subclass of Describable(the superclass) and inherits all the attributes from Describable.
4.3 Overview of FGE-OM
FGE-OM models microarray and proteomics data and a complete listing of the packages
and classes is given in Figure 4.4. All classes in BioOM and ArrayOM are derived from
MAGE-OM. In ProteomicsOM, classes in the packages MassSpecData, MassSpecProtocol
and ProteinSeparation have been derived from PEDRo, classes in ProteinData and Pro-
teomeBioAssay are from Gla-PSI, and ProteinRecord contains newly created classes. In the
rest of this section there is a description of the three namespaces, and the relationships that
exist between classes residing in different namespaces. The use of the model in the context
of a sample biological workflow is also described. A set of detailed diagrams, displaying the
attributes of classes and the cardinality of relationships, is displayed in Appendix B.
4.3.1 BioOM
Figure 4.5 shows the packages in the BioOM namespace. BioOM covers the components
in FGE-OM that are common to all experiment types. The majority of the packages are
identical to packages of the same name in MAGE-OM, as described in Chapter 2, and the
technical documentation that describes MAGE-OM can be obtained via the MGED web site
[212]. There are components of packages BioAssay and BioAssayData (from MAGE-OM)
that contain array specific information, which have been placed in newly created packages
within the ArrayOM namespace. The three abstract classes at the top-level: Extendable,
Chapter 4. Development of a data standard for functional genomics 117
Array
Array
BioAssay
ArrayDesign
Array
BioAssayData
Quantitation
Type
DesignElement
Figure 4.6: The packages in the ArrayOM namespace. The boxed packages are newly createdin FGE-OM but contain a number of classes derived from MAGE-OM. The other packagesare identical to packages with the same name in MAGE-OM.
Describable, and Identifiable are unchanged from MAGE-OM, and most classes inherit
their attributes. Identifiable allows a name and an identifier to be added to classes.
Describable enables links to external ontologies, data ownership and an audit trail to be
attached. Extendable enables a triplet of attributes: Name, Value, Type to be attached to
any class for storage of properties that are not recorded in other parts of the model.
The BioAssay package in MAGE-OM contains a class describing the hybridization
of mRNA to an array. This class has been relocated in our model to ArrayOM,
and a new package (ArrayBioAssay) has been created in ArrayOM containing the
Hybridization class. The rest of the classes in BioOM:BioAssay are the same as in
MAGE-OM. The BioOM:BioAssayData package contains only five classes: BioAssayData,
BioAssayDimension, MeasuredBioAssayData, BioDataTuples and BioDataValues. The
five classes are identical to those in MAGE-OM. These classes specify the general structure
and location of data from any type of experiment and therefore reside in the BioOM names-
pace. BioAssayDimension allows experimental data to be packaged together across a range
of conditions, such as multiple array or multiple gel comparison.
4.3.2 ArrayOM
Packages unchanged from MAGE-OM
The ArrayOM namespace (Figure 4.6) contains the packages derived from MAGE-OM which
are microarray specific. The packages Array, ArrayDesign and DesignElement describe the
Chapter 4. Development of a data standard for functional genomics 118
layout of features on a microarray and have not been altered. QuantitationType includes
details of how array data is analysed using any of the available statistical packages, and is
therefore also included in ArrayOM. However, various data types from functional genomics
experiments could be quantified in similar ways, using standard statistical tests. Therefore,
an alternative design would be to include a generic package in BioOM modelling statistical
processing, recording the software used, and the parameters employed. This design was
considered but has not been implemented at this stage. The software for statistical analysis
of microarray data is continuously evolving and, apart from image analysis, there are no
dedicated statistical packages for quantifying proteomics data.
Differences from MAGE-OM
The ArrayBioAssayData package is a modified version of the BioAssayData package in
MAGE-OM. ArrayBioAssayData includes the MAGE-OM derived class BioDataCube that
represents the three dimensions of data: the array features; the parameter that is varied
across a multiple array experiment; and the values calculated for each array feature, such
as the relative fluorescence. BioDataCube captures the order of the three dimensions, and
stores pointers to separate files containing large quantities of numerical data. The three
dimensions of data also exist in a proteomics experiment, and potentially in other functional
genomics experiments, therefore in theory it should be possible to create a generic data
model in BioOM that models the dimensions of data. However, the BioDataCube is possi-
bly too simplistic to capture proteomics data, having only an ordering and pointers to lists
of values in files. In proteomics, a multiple 2-DE experiment may detect certain proteins
present on one gel and not another, calculated by image analysis software. The comparison
of multiple gels can be error prone and spots matched across multiple gels may have scores
assigned to the quality of the match. Spots may also be matched based on experimental
evidence, such as MS data. A generic data model covering all types of functional genomics
experiments would have to be more complex and would require major changes to the rela-
tionships between classes derived from PEDRo. The ArrayBioAssay package contains only
Hybridization, which is linked to classes in BioOM:BioAssay.
4.3.3 ProteomicsOM
The proteomics namespace (Figure 4.7) is a further development of PEDRo and Gla-PSI.
PEDRo design was based upon different principles than the design of MAGE-OM. MAGE-
Chapter 4. Development of a data standard for functional genomics 119
Protein
Separation
MassSpec
Protocol
Proteome
BioAssay
MassSpec
DataProteinRecord
ProteinData
Figure 4.7: The ProteomicsOM namespace.
OM was intended to be future proof, by including generic attributes in classes, and allowing
data types to be specified using controlled vocabularies of terms, rather than specifying
explicitly in the model which data types should be stored in which position. PEDRo contains
specific named attributes for all the data types that may need to be recorded. In 2-DE, a
gel is used to separate thousands of proteins into individual spots. An image of the gel is
analysed with specialised software that produces output about gel spots, such as an estimate
of volume, area, the coordinates on the gel and many others. PEDRo aims to explicitly define
all of the data types that are produced by current image analysis software and therefore will
require modification in the future. A model following MAGE design principles would have
a placeholder for the first data type and value, followed by the second data type and value,
and so on. ProteomicsOM includes the classes from PEDRo in new packages, however the
classes have been linked explicitly to components in BioOM that allow generic protocols and
parameters to be attached, as required. The following sections describe the classes that are
contained within the six packages of ProteomicsOM.
ProteinSeparation Package
The ProteinSeparation package describes a number of separation techniques, including 2-
DE and liquid chromatography, and is summarised in Figure 4.8. Classes modelling sep-
aration techniques are subclasses of BioAssayTreatment within BioOM. An instance of
BioAssayTreatment can be linked to Protocol, which allows any type of protocol informa-
tion from hardware or software to be added, along with a set of parameters. This mechanism
Chapter 4. Development of a data standard for functional genomics 120
Gel2D
LCColumn
Physical
GelSpot
Fraction
Separation techniques Separation products
Source biomaterial
BioMaterialBioAssay
Treatment
BioMaterial
MeasurementBioOM
ProteomicsOM
Legend
Figure 4.8: The ProteinSeparation package contains classes that model the relationshipbetween separation techniques and the products of those techniques.
can be used to store additional information about proteome experiments, if the attributes
specified in the part of the model derived from PEDRo do not cover the information that
must be recorded. This mechanism will be particularly important for storing information
about nascent technologies that cannot be covered by PEDRo as it stands. The products of
a separation technique, such as a gel spot, or column fraction are modelled as classes, with
a set of attributes capturing the relevant parameters, and are subclasses of BioMaterial.
The classes Gel2d and LCColumn have a large number of attributes that are not displayed in
Figure 4.8 for clarity (Gel2D records the gel dimensions, pI and molecular weight range and
so on). However, more detailed diagrams displaying all the attributes and the cardinality of
relationships are included in Appendix B. A separation product can become the input for
another separation technique, therefore the model utilises a link from BioAssayTreatment
to BioMaterial via BioMaterialMeasurement to specify the source of material. These three
classes are all contained within BioOM.
ProteomeBioAssay package
The ProteomeBioAssay package contains only one class, GelImageAnalysis, however new re-
lationships have been added to enable the re-use of classes in BioOM:BioAssay in the protein
context (Figure 4.9). These relationships have the following semantics. FeatureExtraction
from MAGE-OM models the process by which data is extracted from a scanned microar-
Chapter 4. Development of a data standard for functional genomics 121
BioAssay
Treatment
Physical
BioAssay
BioAssay
Image
Channel
Image
Acquisition
GelImage
Analysis
Measured
BioAssay
Feature
Extraction
Measured
BioAssay
Data
BioAssay
Data
targettreatment
BioOM
ProteomicsOM
Legend
Ontology
Entry
format
Figure 4.9: The relationship between the GelImageAnalysis class, in the ProteomeBioAssaypackage, with classes from the BioAssay package in the BioOM namespace.
ray. In ProteomicsOM, GelImageAnalysis is a subclass of FeatureExtraction, and models
the process of analysing a 2-D gel with specialist software. FeatureExtraction is linked
to PhysicalBioAssay, which is linked to the source image (Image), the scanning process
(ImageAcquisition) and information about a specific channel or wavelength at which the
array has been scanned (Channel). These classes can be re-used in proteomics, to refer to the
scanning of a 2-D gel. The Channel class is re-used from MAGE to model the technique of
difference gel electrophoresis, in which a single gel is scanned at a number of different wave-
lengths. Data that is obtained from image analysis is stored in classes linked to BioAssayData
in the ProteinData package. There are two relationship from MeasuredBioAssay, one to
the data model in MeasuredBioAssayData, the other to FeatureExtraction. This en-
ables the raw data, MeasuredBioAssayData, to be linked to the process by which it was
generated (scanning and image analysis are referenced through FeatureExtraction and
PhysicalBioAssay).
ProteinData Package
The ProteinData package models information about gel spots (Figure 4.10). Spot data is
captured in IdentifiedSpot, which has attributes covering data types produced by image
analysis software. The model also captures data from difference gel electrophoresis. Spots
from the single channel image are captured in DIGESingleSpot, and co-migrated spots from
Chapter 4. Development of a data standard for functional genomics 122
GelImage
Analysis
Feature
Extraction
Identified
Spot
Physical
GelSpotBioMaterial
DIGESingle
Spot
BioData
Tuples
BioData
Values
Multiple
Analysis
Matched
Spots
Physical
BioAssay
BioAssay
Data
BioAssay
Dimension
SpotRatio
BioOM
ProteomicsOM
Legend
Figure 4.10: The ProteinData package.
the composite image are stored in IdentifiedSpot. Spot data is linked to the gel from which
it was produced because IdentifiedSpot is a subclass of PhysicalGelSpot, which is directly
linked to Gel2D in the ProteinSeparation package (Figure 4.8). Spot data is linked back to
the image analysis from which it was produced via BioAssayData and MeasuredBioAssay,
as described above (Figure 4.9). The ProteinData package also captures multiple gel com-
parisons. BioAssayDimension in BioOM models multiple sample comparisons, and is used
in ProteomicsOM by the addition of a link to MatchedSpots, modelling spots matched across
multiple gels to capture differential expression of proteins. MultipleAnalysis is a subclass
of GelImageAnalysis and records the software used for the multiple gel comparison, and
this groups together a set of MatchedSpots in one analysis.
MassSpecProtocol and MassSpecData packages
The packages capturing MS data and protocols contain classes derived from PEDRo (Figure
4.11). MS protocols are modelled by a package called MassSpecProtocol which contains
a class at the top level called MassSpecExperiment. MassSpecExperiment is a subclass
of BioAssayTreatment that can be used to link to the biological substance on which MS
has been performed (in BioMaterial). The substance can be the product of a series of
separation techniques, such as a spot from a 2-D gel. PEDRo-derived classes specify many of
the parameters that are associated with an MS instrument, along with the type of ionisation
Chapter 4. Development of a data standard for functional genomics 123
MassSpecExperiment PeakList
Peak
MassSpecProtocol Package MassSpecData Package
BioOM
ProteomicsOM
Legend
BioAssay Treatment
PEDRo derived classes modelling MS protocol
BioMaterialMeasurement
PEDRo derived classes modelling database searches
Figure 4.11: The model of MS data and protocols, adapted from PEDRo.
Location
species
modificationType
Protein
ModificationProtein
Ontology
Entry
Database
Entry
BioOM
ProteomicsOM
Legend
Figure 4.12: The ProteinRecord package.
employed, such as electrospray or MALDI (described in Chapter 1). Additional text and
parameters not covered in these classes can be attached using the generic Protocol class
in BioOM, linked to BioAssayTreatment. This ensures that the model can be extended
to include protocols from different MS instrument manufacturers, new software, and new
technologies. A new package, MassSpecData, has been defined to capture the list of peaks
from a trace and the database searches that are subsequently carried out. Proteins identified
by MS analysis and database searches are stored in the ProteinRecord package.
ProteinRecord package
A new package was designed to store details of proteins identified in an investigation (Figure
4.12). The class Protein can be referenced from MS data that has been used for protein
Chapter 4. Development of a data standard for functional genomics 124
identification. The protein identifier and database URL are captured in DatabaseEntry, and
the species of origin in OntologyEntry (from BioOM:Description). ProteinModification
stores information about modifications that have been observed. The type of modification,
such as glycosylation or phosphorylation, is obtained from a controlled vocabulary and cap-
tured in OntologyEntry. The position of the modification is captured in Location.
4.3.4 A workflow for proteomics
A sample workflow is displayed in Figure 4.13, demonstrating how FGE-OM captures pro-
teomics data. The overview of the experiment is modelled by the class Experiment. If
the experiment includes multiple samples, for example comparing a number of 2-D gels,
the parameter that is varied between samples, such as the different genotypes of groups of
organisms, is attached to classes referencing Experiment. A biological substance must be
processed to extract proteins, and make the proteins soluble in a multi-stage process. This
is modelled by a series of treatments (Treatment) applied to a substance (BioMaterial), to
produce the final soluble mixture of proteins, on which certain separation techniques may be
performed. Protein separation techniques, such as 2-DE or liquid chromatography, are mod-
elled as specialised subclasses of BioAssayTreatment. Each BioAssayTreatment has a mea-
sured source of material, which is captured in BioMaterial and BioMaterialMeasurement.
When data is produced after imaging a 2-D gel, an instance of PhysicalBioAssay is created.
PhysicalBioAssay can be referenced by the class ImageAcquisition, representing the scan-
ning of the gel. 2-DE image analysis is represented by GelImageAnalysis, which is a subclass
of FeatureExtraction. Gel spot data produced by image analysis can be stored in specific
classes in the ProteomicsOM namespace, linked to image acquisition via MeasuredBioAssay.
If MS is performed on a spot excised from a gel, or a fraction from a column, an instance of
BioMaterial is created, modelling the physical entity that is the excised spot or fraction.
MassSpecExperiment is a subclass of BioAssayTreatment, which can be linked to the source
of material. MS data obtained from a particular gel spot is linked directly to data produced
by image analysis of the spot, which is captured in MeasuredBioAssayData.
4.4 Other work: CEBS object model for systems biology data
Subsequent to the development of FGE-OM, a new model covering several functional ge-
nomics techniques has been published [355]. This section reviews the new model, and dis-
cusses how it can contribute to the on-going standards work for FG.
Chapter 4. Development of a data standard for functional genomics 125
ImageAcquisition
FeatureExtraction
BioAssayTreatment
Physical
BioAssay
Physical
BioAssayImage
Measured
BioAssay
BioMaterial
Measurement
Material TypeDNARNAProteinCell...
Experiment
Treatment BioMaterial
BioMaterial
Gel2D
LCColumn
MassSpec
Experiment
MeasuredBio-
AssayData
GelImage
Analysis
Acquisition
Protocol
Figure 4.13: A workflow for a proteomics experiment involving 2-DE or liquid chromatog-raphy to separate proteins, followed by MS to identify proteins. Diamonds indicate events,rectangles are physical entities and ovals represent data.
Chapter 4. Development of a data standard for functional genomics 126
SpecializedQuantitationType
Intensity IonCount MassValueType Ratio
StandardQuantitationType
Time Volume DerivedSignal ScorePValue
QuantitationType
isBackground : boolean
ConfidenceIndicator
0..3
1
+confidenceIndicators{rank: 4}
0..3
+targetQuantitationType
{rank: 1}1
Figure 4.14: A subset of classes in the QuantitationType package from SysBio-OM. Darkerboxes are newly created classes in the model, lighter boxes represent classes that have notbeen changed from MAGE-OM.
The CEBS object model, SysBio-OM, has been created with similar goals to FGE-OM
and will support a database for toxicogenomics. Toxicogenomics is the study of the effects
of toxicological compounds on gene and protein expression. The model has been created
by merging MAGE-OM and PEDRo, and adding additional classes to model metabolomics
data. SysBio-OM has been developed with the requirements of toxicogenomics in mind, but
the authors claim that it covers generic types of microarray, proteome or metabolome study.
There is no division of technologies into separate namespaces, as in FGE-OM, but new classes
have been added to the packages in MAGE-OM, and two packages, CommonBioAssayData
and SummaryData, have been newly designed. CommonBioAssayData covers protein ex-
pression, protein-protein interaction and metabolomics data, and SummaryData captures
a textual overview of the data to allow a researcher to decide whether a data set may be
relevant without requiring a full data analysis. At the top level there is very little difference
between SysBio-OM and FGE-OM, both have the classes Identifiable, Describable and
Extendable linked to many of the classes in the model. SysBio-OM is identical to MAGE-
OM (and FGE-OM) in the packages: AuditAndSecurity, Array, ArrayDesign, DesignEle-
ment, BQS, HigherLevelAnalysis and Description. The BioAssayData package is identical
to MAGE-OM, which has been split into two new packages in FGE-OM.
Chapter 4. Development of a data standard for functional genomics 127
4.4.1 SysBio-OM data model
The SysBio-OM QuantitationType package contains two superclasses at the top level,
SpecializedQuantitationType and StandardQuantitationType. There are several newly
designed classes in SysBio-OM, including PeakAbundance, Intensity, Percentage, Volume,
all of which are subclasses of SpecializedQuantitationType (Figure 4.14 displays a subset
of classes in the package). These classes capture measurement data for various types of FG
experiment. The MAGE-OM derived classes for quantifying microarray data are subclasses
of StandardQuantitationType. SysBio-OM is not restrictive in the kinds of measurement
that can be used for different technologies, and is therefore more generic than the equivalent
section of FGE-OM. FGE-OM captures measurement data for proteomics in specific classes
in the ProteinData and MassSpecData packages, and microarray data in QuantitationType.
The approach taken in SysBio-OM may be superior for this section, and should be considered
as a possible design for an extension to the QuantitationType package in the next version of
MAGE.
The CommonBioAssayData package is a new feature in SysBio-OM (Figure 4.15) to
model proteomics and metabolomics data. Rows of numerical data are represented by
CommonBioDataTuples, and single data points are subclasses of DataElement (boxed in
Figure 4.15). The raw data values are stored in the class QuantitationDimension in the
CommonBioAssayData package. It is not clear how the model captures information about
spots matched across multiple gels.
The Measurement package in SysBio-OM is an extension of the MAGE-OM package,
incorporating many different types of measurement and units that could be used in functional
genomics. In MAGE-OM, and SysBio-OM, each class has an attribute unitNameCV with
an enumeration of values, e.g. the class TimeUnit has an enumeration containing the values:
years, months, weeks, d, h, s, us, ns, fs, other. The option other is included in almost all
classes in the Measurement package and causes problems for developing applications based
on the model because it is not specified how the type other is controlled or used. The FGE-
OM Measurement package does not have any of the specific classes for units but instead has
two links to the OntologyEntry class to specify the type and name of the unit (Figure 4.16).
This design may be superior because the names of units are not hard coded in the model,
avoiding the problem of the attribute other, and it is therefore unlimited in what can be
captured. It is a simple task of incorporating all the measurement types and units into the
MGED Ontology, which already includes most of those added to SysBio-OM.
Chapter 4. Development of a data standard for functional genomics 128
Figure 4.15: The CommonBioAssayData package from SysBio-OM. The boxed classes arediscussed in the text.
Chapter 4. Development of a data standard for functional genomics 129
SysBio−OM
FGE−OM
Figure 4.16: The top image shows a small subset of classes from the Measurement packagein SysBio-OM, the lower is the Measurement package in FGE-OM.
Chapter 4. Development of a data standard for functional genomics 130
Figure 4.17: The Protocol package from SysBio-OM. The boxed classes are newly created.
Chapter 4. Development of a data standard for functional genomics 131
4.4.2 SysBio-OM Protocol and BioMaterial packages
The Protocol package in SysBio-OM diverges from MAGE-OM (Figure 4.17) by introducing
new packages for different types of protocol (1-D, 2-D gel, MS database search and NMR).
The model does not specify what attributes belong to these classes, therefore this may create
confusion for developers using this part of SysBio-OM. The Protocol package in MAGE-OM
was intended to be independent from technology and can therefore be re-used with no change
for any type of FG experiment. The addition of new classes without attributes does not add
significantly to what can be captured by this part of the model. A new design that can
capture all the information in Protocol of SysBio-OM but remain generic would introduce
a new relationship from the Protocol class to OntologyEntry called protocolType, which
captures the type of protocol, such as 2-D or 1-D gel.
The BioMaterial package in SysBio-OM has several new classes, modelling gel spots and
column fractions, derived from PEDRo (Figure 4.18). These classes also exist in FGE-
OM but reside in the ProteomicsOM namespace in order to leave the BioMaterial package
independent of any technology, however the core functionality of the two models is very
similar for this part. It may be advantageous to put technology specific classes in separate
packages, as in FGE-OM, so that it is easier for developers to understand the intended usage
of the model and focus only on the parts of the model that are required.
4.4.3 SysBio-OM BioAssay and SummaryData packages
The BioAssay package in SysBio-OM is displayed in Figure 4.19. The intended usage of the
package is very similar to a combination of BioAssay, ArrayBioAssay and ProteomeBioAssay
in FGE-OM. A new class, GelFeatureExtraction, models the process of gel image analysis
enabling the classes Image, Channel and ImageAcquisition from MAGE-OM to be re-used
in the proteomics context. Another new class, CommonBioAssayCreation, models techniques
such as a 2-D gel, NMR or a column separation, and links to data acquisition and raw data,
such as images, through the PhysicalBioAssay class. CommonBioAssayCreation functions
in a very similar way to BioAssayTreatment in FGE-OM (although BioAssayTreatment
also exists in SysBio-OM with a different function). CommonBioAssayCreation references
the source material for a treatment through BioMaterialMeasurement in exactly the same
way as in FGE-OM. PhysicalBioAssay has associations with classes modelling column or
NMR data files for metabolomics data (NMROutputFile and ColumnFractionOutputFile).
The SummaryData package is a new development proposed in SysBio-OM (di-
Chapter 4. Development of a data standard for functional genomics 132
Figure 4.18: The BioMaterial package from SysBio-OM.
Chapter 4. Development of a data standard for functional genomics 133
Figure 4.19: The BioAssay package from SysBio-OM.
Chapter 4. Development of a data standard for functional genomics 134
agram not shown) which contains only two classes QualitativeOrSummaryData and
DataInterpretation. These two classes are for adding textual descriptions onto the ex-
periment and it remains to be seen how this differs from what can be captured in the
Experiment package.
4.5 Discussion
The object model, FGE-OM, was created in UML to represent both proteomics and microar-
ray experiments. FGE-OM is based on MAGE-OM and incorporates additional information
from PEDRo and Gla-PSI. There are three namespaces in the new model: BioOM, ArrayOM,
and ProteomicsOM. The BioOM namespace is suitable for describing a generic functional
genomics experiment, encompassing microarrays, 2-DE, histochemistry and others. The
ProteomicsOM namespace was defined from PEDRo and Gla-PSI, and includes classes with
attributes covering 2-DE, MS and data analysis. ProteomicsOM has been integrated with
BioOM, enabling generic protocols, including details of hardware or software, to be attached
to specific classes. FGE-OM uses inheritance from several key superclasses: experimental
techniques are modelled as subclasses of BioAssayTreatment and the products of treatments
are subclasses of BioMaterial. This framework will allow new models describing other tech-
nologies to be added into FGE-OM without significant difficulty, allowing a unified model
for functional genomics to be created in the future. An important use of FGE-OM will be to
generate an XML Schema, to allow research groups to format data in a consistent manner
into FGE-ML, a markup language based on the model. A software toolkit is also required,
based on the microarray software toolkit (MAGEstk), for creating FGE-ML from the object
model.
FGE-OM has been created by merging models that have slightly different design princi-
pals. MAGE-OM was intended to be “future proof” by including generic classes that could
be used for various technologies. Conversely, PEDRo aimed to describe the current status of
proteomics experiments, recognising that future developments would require changes to the
model. The forthcoming versions of both MAGE-OM and the protein model, PSI-OM, will
undergo changes that may bring about the convergence of the different design principles. In
other words, MAGE-OM will include classes for some parts of the model that capture the
standard cases more simply, and PSI-OM will utilise more generic classes to model exper-
imental protocols and biological samples. This issue is outlined in detail in the following
section. We believe that the design process for the next version of both MAGE-OM and
Chapter 4. Development of a data standard for functional genomics 135
PSI-OM should be guided by the experience of developers who have attempted to create
software based on the two models. It is our view that ontologies should be used extensively,
to reduce the burden on the developers to create an object model that captures all possible
uses of the technology.
FGE-OM demonstrates that the integration of the two current versions of the object
models is feasible. We believe that even if the next versions of the models are developed
independently, the framework described here can be easily evolved, reflecting the changes to
the new object models, and there are significant benefits to capturing both microarray and
proteomic technology in the same structure.
4.5.1 FGE-OM, SysBio-OM and future standards
The CEBS SysBio-OM model is an alternative proposal for an FG data standard. There are
currently no major proposals specifically for metabolomics, however CCPN (A Collaborative
Computing Project for the NMR Community) is fairly well established in the NMR com-
munity and contains an object model and programming interface [113]. The metabolomics
part of SysBio-OM comprises a simple model of NMR data, therefore the CCPN proposals
may be able to contribute to the efforts, and both models should stimulate discussion in the
metabolomics field as to the requirements for a data standard.
In overview of SysBio-OM, new classes have been added to seven MAGE packages to
cover proteomics, and two new packages have been created. The object model has been used
for generating code that acts as a bridge between flat data files and the CEBS database,
and it is planned that future functionality will enable import and export of MAGE-ML and
the future proteome standard, PSI-ML. Another function of SysBio-OM is to act as a pro-
posal for the future development of an integrated data standard across several fields. The
design of certain packages, such as the QuantitationType package, serves this purpose well,
because it is generic, and can capture a wide range of quantitation types. The design of
other packages such as BioMaterial and Protocol mixes the generic approach of MAGE with
technology specific classes. It is our view that this may cause problems because the design of
MAGE will change for the next version, and the PEDRo proposals are changing to become
PSI-OM, as reported in the previous chapter. Therefore, it is likely that a large amount of
work will be required to redesign these packages to reflect the changes to MAGE-OM and
PSI-OM, but this should not be the case for FGE-OM. FGE-OM separates different tech-
nologies with only a few key relationships linking classes in different namespaces, and the
Chapter 4. Development of a data standard for functional genomics 136
original functionality of MAGE-OM packages is maintained in almost all cases. Therefore,
when PSI-OM is finalised it can be easily merged with the next version of MAGE, using
FGE-OM as a guide. The packages CommonBioAssayData and BioAssay in SysBio-OM
function in a similar way to a combination of ProteinData and the three related BioAs-
say packages in FGE-OM. The CommonBioAssayData package (SysBio-OM) appears to be
more generic than ProteinData (FGE-OM) and utilises inheritance from the superclasses
DataElementDimension, DataElement and QuantitationType for the three dimensions of
data. It remains to be seen how this works in practice, but if a successful implementation
of this part of the model is demonstrated in the CEBS database, this may represent a good
framework for developing a generic data model across all FG experiments. It is likely that
the best design of a standard for FG will take parts of both SysBio-OM and FGE-OM and
a potential framework for this integration is described below.
4.5.2 Developments to MAGE-OM
The division of FGE-OM into namespaces is a simple but important concept that should
make a large object model easier to understand, allowing developers to focus more quickly on
the relevant parts. The next version of MAGE is planned to contain a core of components
that are shared across all types of FG experiment, similar to the BioOM namespace. A
structured description of the purpose of the experiment, the biological samples and the
parameter that is varied across samples is the most important part of the core. All types
of FG experiment can be described in this way and the use of the MGED Ontology, and
extensions to it, will be an essential component. This part of the design ensures that the
purpose of the experiment can be determined very easily by manual or automated inspection
of files rather than having to parse all the information in the document and search for the
differences between the samples. For example, the purpose of an experiment may be to
determine the changes in gene expression between two cell lines, one of which had gene X
knocked out. This information must be easy to search for as it is one of the most crucial
parts of the experimental annotation. FGE-OM, MAGE-OM and SysBio-OM have classes
at the top level, ExperimentFactor and ExperimentFactorValue, which allow the critical
characteristics and differences between the samples under comparison to be specified. These
classes are vital for the purpose of the experiment to be easily understood and therefore the
FG data standard should retain them at the top level. A database should ensure that these
attributes are stored in a way that allows rapid querying and programmatic access to this part
Chapter 4. Development of a data standard for functional genomics 137
of the annotation. I believe that the next version will benefit from the proposed extensions
of SysBio-OM and FGE-OM. The Quantitation and CommonBioAssayData packages from
SysBio-OM offer a generic framework for capturing FG data and could be incorporated into
the core namespace. In FGE-OM, the simplification of the Measurement package may be
advantageous and should be considered.
The next version of MAGE aims to fix semantic annotation problems with the current
version that have been discovered over several years since its release. PEDRo has been
widely accepted as a draft standard from which the first formal proteomics standard can
be developed. It is vital that PSI-OM, which will supersede PEDRo, utilises the experience
gained from MAGE to avoid the same problems. One general criticism of MAGE-OM is that
for certain concepts it is “over engineered”, in other words, the designers attempted to define
a model that could cover all eventualities but the most common case is captured in a complex
way. Large efforts are required from software developers to create applications that produce
MAGE-ML, and there are still relatively few public databases that offer MAGE-ML input
and output, although this feature is in development for almost all microarray databases.
The next version of MAGE is likely to make greater use of the OntologyEntry class, and
PSI-OM should also utilise ontologies to capture complex concepts. The PSI ontology (PSI-
Ont) will become an extension of the MGED Ontology. PSI-OM will be designed with
the consideration of future integration with MAGE, and the separate mass spectrometry
standards that are under development (as described in the previous chapter).
4.5.3 Integrated standards
The development of an integrated standard requires joint meetings between PSI and MGED.
The two organisations are now committed to co-developing a standard, however the devel-
opment of MAGE will first focus on the creation of a core module, based around similar
principles to BioOM. The last meeting of PSI (Nice, France 2004) was attended by sev-
eral key developers of MAGE, and the previous MGED programming workshop (European
Bioinformatics Institute, Cambridge, UK Dec 2003) had presentations by members of PSI.
FGE-OM was presented at both meetings by the author. It is vital that collaboration con-
tinues between the two organisations. This requires principal investigators to present work
to the wider biological research community to ensure that there is a good awareness and
support for the standard. The Object Management Group (OMG) was involved with the
development of MAGE-OM, providing a framework for checking the consistency of the object
Chapter 4. Development of a data standard for functional genomics 138
model. The future FG standard should also be vetted through OMG, because while this in-
troduces extra developmental stages, there are likely to be fewer problems that arise once the
model is being used by a large community. Finally, there needs to be a number of workshop
meetings in which developers of MAGE and PSI-OM work together to define a format that
captures everything that is required in the two fields. The format should support functional
genomics, not just microarrays and proteomics, therefore researchers in other parts of FG
research should also be aware of the efforts. It is likely that a data format will only gain
widespread support once several major databases are committed to its development. One
other consideration is that a data format that can encompass a range of functional genomics
techniques may be too bulky for many users who use only a single technique and wish to
utilise a subset of the standard. If the different namespaces are well designed, it will be pos-
sible to derive the single technology data formats from the model, MAGE-ML and PSI-ML,
for transferring results to databases storing only microarray or proteomics experiments.
In the following chapter, the development of an Internet accessible database is described,
which will ultimately form part of a large system for functional genomics. The CEBS
database will also offer access to various types of FG data, and it is likely that several
other systems will come on-line in the next few years. It is important that developers of
different systems collaborate at an early stage to avoid the data incompatibility problems
that have arisen over the last decade in biomedical research, which make the challenge of
data integration so great.
4.6 Conclusion
The chapter has described the development of an object model for functional genomics. FGE-
OM comprises three namespaces that have been created to reflect the different components
in a large biological investigation. BioOM contains twelve packages and ArrayOM contains
six packages that match very closely the structure of MAGE-OM. The third namespace,
ProteomicsOM, comprises six packages that contain classes derived from PEDRo and Gla-
PSI. FGE-OM is intended to demonstrate a potential schema for the integration of microarray
and proteomics data standards, and acts as a proposal from which the next version of MAGE-
OM can be developed. The division into namespaces should allow the model to evolve as
the proteomics and microarray proposals change, and also creates a framework that enables
object models from other types of FG experiment to be integrated. FGE-OM has been
presented to PSI to influence the design of the finalised standard for proteomics, and to
Chapter 4. Development of a data standard for functional genomics 139
MGED to generate discussion about the next version of MAGE-ML. The model has been
verified against real data by the development of a database implementation that matches
the structure of the object model very closely, described in the following chapter.
Chapter 5
A prototype public database for
proteomics
5.1 Introduction
The main aim of the research presented in this thesis is to improve the facilities for data
sharing and querying in functional genomics (FG). In the previous chapter, the definition of
a functional genomics object model was given, which acts as a proposal for a data standard.
In this chapter, a database implementation is discussed which is capable of storing data from
both microarrays and proteomics. The RAPAD (RNA And Protein Abundance Database)
system is an extension of the RAD microarray database from the University of Pennsylvania,
into which a proteomics component has been incorporated. There are many database systems
for storing microarray data (ArrayExpress, GEO, and SMD summarised in Chapter 2) and
several initial attempts to capture proteomics data, of which SWISS-2DPAGE is the most
well established. However, there is no major public repository that covers both protein
separations experiments and mass spectrometry, and an integration of data from microarrays
and proteomics has not previously been demonstrated in a database.
5.1.1 Extending existing technology
A description of different experiment types used in FG was given in Chapter 1. In overview,
a typical proteomics experiment involves obtaining a set of samples produced under different
conditions and attempting to separate, identify and (possibly) quantify the proteins present
in the different samples. Where proteomics differs from microarray analysis is the range of
different methods that could be used at each stage to get the final result, including: multiple
separation stages, novel techniques for quantifying protein abundance, and identification of-
ten through mass spectrometry (MS) accompanied by database searches. A significant part
140
Chapter 5. A prototype public database for proteomics 141
of the challenge of formally describing this information occurred during the development of
the object models described in the previous two chapters. Therefore, the major implementa-
tion challenges involved creating interfaces for capturing data and protocols, development of
complex query facilities, visualisation of results and data integration. In Section 5.2, there
is a description of databases that exist for capturing proteomics data, however none offer
a complete solution storing 2-DE, MS, and experimental protocols. Therefore, a system is
required that can capture a complete proteomics workflow in a structured format that can be
queried. The decision to extend the RAD system into proteomics rather than develop a new
database from scratch was based on several criteria. Firstly, it is important that microarrays
and proteomics data can be queried side by side, and in conjunction with other functional
genomics data. This will be facilitated by having a shared database schema and user inter-
face, and it will be easier to produce a mapping from an object model, such as FGE-OM,
to a database if the general structure is similar. There is already a close correspondence
between RAD and MAGE-OM, therefore a large part of FGE-OM is already mapped to
a relational representation. Secondly, RAD is a part of the GUS system which is a major
public repository, providing access to genomic sequence data, ESTs, RNA, SAGE [332] and
gene expression data. One of the long term goals of GUS is to incorporate proteomics, im-
munohistochemistry, and cell anatomy components, creating a single access point to many
types of functional genomics data (Figure 5.1). Therefore, RAPAD also serves as a prototype
for developing a proteomics namespace in GUS which, when complete, will provide access to
2-DE, MS and other proteomics data for major web sites such as PlasmoDB [21], ToxoDB
[187] and GeneDB [127]. Thirdly, the time required to develop a large system is significantly
reduced if developing on top of established software, compared with developing de novo. In
summary, RAPAD was developed with several major goals that are explored in the rest of
the chapter:
• RAPAD functions as a prototype for a major public repository for proteomics data,
and ultimately will form part of GUS.
• The implementation was created to provide a framework for developing tools for in-
vestigating the correlation between gene expression and protein abundance, stored in
the same database.
• The current implementation, while serving as a prototype for the future development
of a public resource, also has acts as a platform for supporting on-going proteomics re-
Chapter 5. A prototype public database for proteomics 142
A foreseeable possible problem is that RAPAD does not currently have a permanent
home and the web address is likely to change. However, this problem can be avoided as
long as the Bioinformatics Research Centre in Glasgow (brc.gla.ac.uk) does not develop
an alternative database called RAPAD, which is unlikely. The LSID project can be easily
implemented if databases adhere to the guidelines and provide programmatic access to the
database, accepting the LSID of an object as a query string.
5.4 Implementation
RAPAD has been deployed in Oracle 9i [235] as part of a standard three tier architecture
(Figure 5.3). A web interface has been created, written in the PHP language [246]. The
database schema is large (174 tables), and has the capability to store information from a
wide range of technologies. Therefore, web pages have been developed for data capture as
they are required by the users. In this section, an overview of each part of the database is
described, using examples of data capture in the Study-Annotator to illustrate graphically
how the database has been implemented.
A workflow is displayed in Figure 5.4 summarising the stages at which data is entered
by the user. There are several stages at which queries are made of the database to retrieve
terms from an ontology to populate drop-down boxes in the user interface, described in more
detail below.
Chapter 5. A prototype public database for proteomics 154
DatabaseOracle
ServerImage
ServerMASCOT
PHP
Interface generation
MASCOT Results
User InterfaceMiddlewareData Storage
Java
Querier
Study−Annotator
Gel Viewer
Batch queries forspecific investigationsand Gel Viewer code
Perl
Scripts suppliedwith MASCOT
Figure 5.3: The architecture of RAPAD.
Chapter 5. A prototype public database for proteomics 155
Login Page
Study, contactsand references
BioSource andsolubilisationprotocol
image analysis
2−DE, image,scanning and
Visualise 2−DEin Gel Viewer
1) Query DB forontology terms
2) Add details to DB
1) Query DB forontology terms
2) Add details to DB
and spot detailsQuery for gel
Data entry
Data entry
Data entry
Check usernameand password
Add details to DB
Data entry
User interaction RAPAD Study−Annotator Oracle database
1)
1)
2)
2)
and bulk loadin two files
Figure 5.4: The user interaction with RAPAD for entering a 2-DE experiment.
Chapter 5. A prototype public database for proteomics 156
5.4.1 Data privacy
The first entry point for the RAPAD Study-Annotator requires users to login, and select
their data privacy preferences (Figure 5.4). Essentially, this requires selecting the Project,
Group and Study settings. The Project setting specifies the database namespace in which the
data will be stored, which will be required when RAPAD is integrated with GUS. The value
of Project is set to “RAPAD” in the current implementation. The Group setting is the top
level for dividing researchers into different classifications. It is envisaged that each laboratory
will have its own Group value. The Study value is a further specification, and captures a
complete investigation, consisting of many different 2-DE experiments. For example, the
entire Trypanosoma brucei proteome investigation is currently captured as one study. All
tables in RAPAD have the following attributes, to ensure data integrity:
MODIFICATION_DATE NOT NULL DATE
USER_READ NOT NULL NUMBER(1)
USER_WRITE NOT NULL NUMBER(1)
GROUP_READ NOT NULL NUMBER(1)
GROUP_WRITE NOT NULL NUMBER(1)
OTHER_READ NOT NULL NUMBER(1)
OTHER_WRITE NOT NULL NUMBER(1)
ROW_USER_ID NOT NULL NUMBER(12)
ROW_GROUP_ID NOT NULL NUMBER(3)
ROW_PROJECT_ID NOT NULL NUMBER(3)
ROW_ALG_INVOCATION_ID NOT NULL NUMBER(12)
The attributes ROW USER ID, ROW GROUP ID and ROW PROJECT ID are assigned the foreign key
linking to the corresponding record for each user, group and project for every record that is
entered in the database. Additional tables exist for linking information to the Study in which
it belongs. Data security issues are discussed in more detail in Section 5.4.7.
5.4.2 Studies, protocols and contact details
Bibliographic references, experimental protocols and contact details can be entered in RA-
PAD, and are not linked to any particular study, allowing their re-use in many different
contexts. The web page for entering Protocol data (Figure 5.5) has drop-down menus for
selecting the type of protocol, options include nucleic acid extraction, protein solubilisation,
Chapter 5. A prototype public database for proteomics 157
Figure 5.5: The interface for entering protocol information into RAPAD.
gel stain, and so on. These options are populated from the OntologyEntry table, and are
used for linking the protocol to the correct page in the Study-Annotator. For example, any
protocols entered with the option gel stain will appear as options for linking to a staining
protocol in the 2-DE Assay page of the interface.
A set of web pages exist for capturing the intention of the study as a textual descrip-
tion, and also a set of parameters can be entered, with a different parameter value for each
experiment in the study. For example, in a time course experiment, samples from 1, 2, 4,
6, and 24 hours post infection are each analysed by 2-DE. This information can be cap-
tured in RAPAD, linking the parameter to the 2-DE details, and in turn the 2-DE details
can be linked to a description of the protein sample (BioMaterial). The source of mate-
rial can be entered in RAPAD, linked to contact details for the provider of material, the
species of origin, type of material (e.g. DNA, protein, cells, generated from entries in the
OntologyEntry table), and a general description (stored in the table BioSource, Figure
5.6). A series of treatments can be applied to convert a source of material (BioSource) to
a substance (BioMaterial), such as a protein mixture, which can be linked to a 2-D gel
record. Alternatively, BioMaterial could store labelled mRNA that has been hybridised to
a microarray. Treatments correspond to basic laboratory procedures such as additions of
Chapter 5. A prototype public database for proteomics 158
Figure 5.6: A web page for specifying sources of biological materials
solutions, washes, incubations and many more, allowing a researcher to store a structured
definition of lab protocols, such as the extraction and solubilisation of proteins from cells.
These features have been inherited from RAD, however additional tables have been added to
the database schema: StudyAssayProt, StudyDesignAssayProt and so on, for linking study
and biomaterial details to the corresponding proteomics experiment (table ProteomeAssay)
rather than a microarray (table Array, Figure 5.7).
5.4.3 Protein separations
RAPAD has capabilities to store information describing a series of protein separation treat-
ments (Figure 5.8), although the focus of the current implementation is 2-DE. Every experi-
ment type has an entry in a specific table (e.g. Gel2D, Gel1D or LCColumn) and an entry in a
generic table, BioAssayTreatment. BioAssayTreatment can be linked to a measured input
of a biological material, captured in AnalyteMeasurement and a view2 (BioMaterial) on
the table BioMaterialImp. The output of each treatment produces a set of entries in spe-
cific tables, such as PhysicalGelItem and Fraction, which are linked to BioMaterialImp,
enabling a series of treatments with specified inputs and outputs to be captured in a struc-
2A view in SQL is a single table that is derived from other tables. A view may not be physically stored inthe relational schema but is a notation representing certain information that is frequently required [89].
Chapter 5. A prototype public database for proteomics 159
Proteome
Assay
StudyDesign
AssayProt
StudyAssay
Prot
StudyFactor
ValueProt
Study
StudyFactorStudyDesign
Gel2DBioAssay
Treatment
Assay
StudyDesign
Assay
StudyAssay
StudyFactor
Value
Study
StudyFactorStudyDesign
Array
Proteomics Microarrays
Figure 5.7: A summary of the database schema for storing information about the design of astudy. Three RAD derived tables have been replicated in the RAPAD schema with changesto one relationship, referencing ProteomeAssay rather than Array. Each box represents adatabase relation (table) and arrows represent a relationship between two tables, such asGel2d has a foreign key from BioAssayTreatment.
tured format. The BioAssayTreatment table has a relationship to Protocol, which enables
additional protocol information to be attached to a technique, if the attributes specified in
the table specific to the technique do not cover what is required.
5.4.4 2-D gel data
The details about a 2-D gel are entered on the 2-DE Assay page (Figure 5.9). The parameters
of the gel are entered in the Gel2D table, and the table ProteomeAssay stores the name of
the experiment and a link to the experiment’s operator. ProteomeAssay is used to link
indirectly to protocols for the first and second dimension separation, protein solubilisation
and staining (all stored in the table Protocol). Following input of 2-DE data, scanning
information can be entered into the table ImageAcquistion, capturing: the type of scanner
used, the operator, the date, a protocol if required and any associated parameters with values.
Multiple scans can be entered, each associated with a particular channel or wavelength, which
can also be used to store a difference gel electrophoresis experiment, in which a single gel
is fluorescently labelled and scanned at two or three wavelengths. Each scan is assigned
a unique name that appears on the Gel Image Analysis page. On this page, the user can
Chapter 5. A prototype public database for proteomics 160
Gel2D
Gel1D
FractionLCColumn
Physical
GelItem
Link to image
analysis data
Source Product
BioAssay
Treatment
Analyte
MeasurementProtocol
BioMaterial
Imp
Figure 5.8: The database schema for protein separation techniques and the relationships tothe BioAssayTreatment table.
enter a protocol and name of the software used to analyse the gel image (inserted in the
GelImageAnalysis table). The image scan must also be associated with a gel image that is
stored on the file system, and the URI (Uniform Resource Indicator) of the file is updated
in the ImageAcquistion table. Two further pages exist for bulk loading data: gel spot files
and protein files. Spot data files contain lists of spot ID numbers, coordinates and volume
values (calculated by image analysis), which are stored in the tables IdentifiedSpot and
PhysicalGelItem. Each IdentifiedSpot record links to the image analysis that produced
it (in GelImageAnalysis). Data files can also be loaded that contain tab delimited data
about the proteins to which spots have been matched, including: the protein name, species,
MW (molecular weight), pI (charge), links to external databases, and a link to MS data on a
separate file server. The data is loaded in batches, linked to the correct spot using the table
AnalyteMeasurement, linked to BioMaterial and PhysicalGelItem (Figure 5.11).
The schema design for this section is fairly complex (Figure 5.11), however this reflects
the nature of a proteomics experiment: a spot may be excised from a gel and could be used in
a number of different experiment types: MS, chromatography, or additional gel separations.
Therefore, an entry exists to model a gel spot as a physical entity (a BioMaterial), to enable
further treatments on the spot to be captured. A gel spot does not have an identifier until
the gel image has been analysed and spot data has been input. Therefore, to correctly specify
a gel spot, a record is required in the IdentifiedSpot table (from image analysis), in the
Chapter 5. A prototype public database for proteomics 161
Image acquisition
Image analysis
2−DE assay
Figure 5.9: Screenshots for loading 2-DE, scanning and image analysis data into RAPAD.The scanner image is obtained from http://biology.berkeley.edu/EML/scanner.jpg, the im-age analysis software is a screenshot of DeCyderTM[74].
Chapter 5. A prototype public database for proteomics 162
BioAssay
Treatment Gel2D
Image
Acquisition
Identified
Spot
DIGESingle
Spot
Channel
GelImage
Analysis
Physical
GelItem
Matched
Spots
Multiple
AnalysisProteome
Assay
Figure 5.10: The tables present in the database schema store data from gel spots, imageanalysis and the scanning of a 2-D gel. The database also records information about spotsmatched across a number of gels in MatchedSpots and MultipleAnalysis, and differencegel electrophoresis data in the table DIGESingleSpot.
Protein
Record
MassSpec
Experiment
Identified
Spot
PeakList
BioAssay
Treatment
Physical
GelItem
BioMaterial DBSearch
ProteinHit
Analyte
Measurement
Direct link to top protein hit
Figure 5.11: The database schema for linking protein records to gel spots. A protein recordis linked to the gel spot via the raw MS data and database searches that have performed foridentification. A direct link from the gel spot (PhysicalGelItem) to the protein record hasalso been implemented to enable fast queries.
Chapter 5. A prototype public database for proteomics 163
PhysicalGelItem table (referring to the actual spot on the gel) and in the BioMaterial view
when required, to enable the gel spot to be linked to additional treatments in the database
(via BioAssayTreatment). If spots corresponding to the same protein have been matched
across gels, this information can be captured in the table MatchedSpots, and spots appear
with a different symbol in the Gel Viewer.
5.4.5 Mass spectrometry and external databases
Mass spectrometry data can be stored in tables derived from the PEDRo database schema.
The tables are linked to rest of the schema via the BioAssayTreatment table (Figure 5.12).
BioAssayTreatment references a source of biological material, enabling MS data to be linked
to a protein sample arising from a series of separation techniques, which could be a gel spot.
However, in the current implementation only a URI is stored in the DBSearch table, linking
to the results of searches with MS data, generated using the MASCOT software. Certain
data are extracted automatically from the MS results using a script developed by Karl
Burgess (IBLS, University of Glasgow), and stored in the ProteinHit table, such as the
match score, e-value, the number of peptides hit in a sequence, and the sequence coverage3.
These factors enable the quality of match to be determined, allowing researchers to exclude
data from certain views in the interface, if the MS data does not conclusively identify a
protein. The table ProteinRecord stores properties of each protein in the database, such as
MW, pI, the protein’s name and a reference to the species of origin, stored in the SRes Taxon
table. The table ProteinRecordEntry links a record to external database entries, stored in
DatabaseEntry. DatabaseEntry captures the database accession number, and has three
external links to OntologyEntry in which the database name, database URI and database
version are captured. In this way, a protein record can be linked to any external database
required, as long as it is Internet accessible.
5.4.6 RAPAD Querier
An important feature of a database system for functional genomics is the ability to perform
complex queries. The current RAPAD implementation includes a set of tools that enable
data to be visualised and queried, to support biological research. The use of the query
interface, the RAPAD Querier, is outlined in Chapters 6 and 7 with regard to two biological
investigations: the proteome of host cells when invaded with the parasite Toxoplasma gondii
3Sequence coverage is the percentage of the protein sequence that is covered by the peptides that havebeen matched.
Chapter 5. A prototype public database for proteomics 164
BioAssay Treatment
MSExperiment
Tables for protocol Tables for database searches
PeakList
Peak
ProteinHit
ProteinRecord
ProteinModification
Physical GelItem
Figure 5.12: The database schema for mass spectrometry, adapted from PEDRo.
and the determination of the proteome of Trypanosoma brucei. Specific features of the
interface have been geared towards providing the queries required by the two projects, to
solve specific goals. An overview of the main features of the RAPAD Querier is given in the
rest of this section.
There are several different methods for accessing data in RAPAD. Firstly, for researchers
annotating data in a particular study there is an option to load any of the gels in that study in
the Gel Viewer (Figure 5.13). Researchers can also perform a search to find all 2-D gels within
their Project-Group preference settings, within a particular study, performed by specific
operators, or containing a certain protein name. The Gel Viewer has been implemented as
a Java Applet [168], an application that runs within a web browser, thereby enabling any
users to view data without needing to install new software (except Java). The Gel Viewer
is capable of loading multiple gels simultaneously in different tab windows. Within the Gel
Viewer basic searches can be performed to find particular protein names, label a specific spot
by ID number, or highlight a set of proteins with a range of molecular weights or pI values.
Controls exist for moving around the gel and zooming on particular regions for highlighting
subtle differences in spot patterns between two or more gels. Once the Gel Viewer has been
loaded, there are a set of options for viewing data about a single gel: 1. Display All Spots,
2. Display All Proteins, 3. Search This Data, 4. Display Gel Details, 4. Show Gel Info, 5.
Show Microarray Data, and if two gels have been loaded 6. Show Matched Spots.
1. There is an option to view a table, created dynamically in HTML, showing all the discrete
Chapter 5. A prototype public database for proteomics 165
Figure 5.13: A screen shot of the 2-D Gel Viewer that provides search capabilities overprotein data and links to MS results. There is a feature for loading multiple gels in differenttabbed windows, for example for comparing gels for samples for different conditions.
Chapter 5. A prototype public database for proteomics 166
Figure 5.14: A form for entering annotation about a gel spot and linking to protein records.Links are provided for adding data about protein modifications and updating the proteindetails.
Chapter 5. A prototype public database for proteomics 167
Figure 5.15: A table displaying all the proteins identified on a single gel.
spots that have been identified on a gel. Hyperlinks exist for each spot ID number which
load the specific record about each spot (Figure 5.14), which enable additional annotation to
be entered, and for linking a gel spot to protein data, such as MS information. There is also
a page for entering the type, location and description of post-translational modifications.
2. Similar output is provided displaying only the gel spots that have been matched to protein
records (Figure 5.15).
3. An option is given for loading an HTML form that enables searches to be performed over
a data set arising from a single gel. Search criteria include approximate matches to multiple
protein names entered, ranges of values for molecular weight, pI, and statistics from MS
data about the quality of a match (Figure 5.16). Boolean “AND” or “OR” searches can be
performed, and the resulting data can be ordered by any of the above criteria. The results
of a search are displayed in a table on a web page, with links to the source data, and an
option exists for highlighting the spots found by a search in the Gel Viewer.
4. Clicking the Display Gel Details button loads a page displaying the parameters and
protocol employed for the gel. There are links to separate protocols for the first and second
dimension separation, staining, and protein solubilisation. If the gel has been linked to
information about a biological sample (BioSample) or source of material (BioSource) in
Chapter 5. A prototype public database for proteomics 168
Figure 5.16: The query interface for searching for specific protein records.
RAPAD, this information is displayed, along with the protocol for gel image scanning and
gel image analysis.
5. If the gel has been associated with a microarray study, a table can be loaded displaying
all the proteins on the gel, alongside the microarray expression values for the corresponding
gene. This feature is illustrated in the following chapter.
6. RAPAD has options for loading information about spots on different gels that correspond
to the same protein. Clicking the Show Matched Spots button loads an HTML page display-
ing all the proteins on the two gels. Spots that match across the two gels are highlighted
in bold, and if spot volume information has been entered, the ratio of volumes is displayed,
corresponding to an approximation of the change in expression of the protein between the
two conditions.
An important feature is the ability to summarise all the data within a study, especially if it
results from proteins identified on a number of different gels. An option exists to classify gels
within a study into two groups, for example one set of gels from “disease” samples, versus a
set of “normal” samples. The proteins identified in the two groups appear in separate tables,
with links to the source protein records, and an option to load the Gel Viewer highlighting
selected spots on the gel.
Chapter 5. A prototype public database for proteomics 169
5.4.7 Public data access
The standard interface contains pages displaying protein spot records that can be updated,
intended for researchers to modify and insert new data as required. Clearly, this system
is not suitable for external access, even if updates could only be performed by researchers
with a specific login, because it would be difficult to ensure that data was always secure.
Therefore, a separate interface has been created allowing anyone to view publicly acces-
sible data in RAPAD, which only has views of the data, with no facilities for updating.
Data can be accessed in this interface through a page that displays all the public studies
in RAPAD, giving the option to load particular gels. A query page is also available to
search for particular proteins identified on any gel in the public system. The page displaying
protein records in RAPAD can be queried by a web link, thus providing basic program-
matic access. The following URL can be used to link to any record on the public system
The attribute CATEGORY captures the type of term: ProtocolType, DevelopmentalStage,
DataType and so on. An example would be:
• CATEGORY = ProtocolType
• VALUE = nucleic acid extraction
• DEFINITION = "The procedure of extracting nucleic acid from the
biomaterial"
In RAPAD, additional entries have been included in the OntologyEntry table to cover prop-
erties of proteins, such as types of chemical modifications. The storage of post-translation
modification (PTM) data is an important feature of RAPAD, which for instance may be
generated from tandem MS or from a phosphate labelling experiment. The type of PTM,
Chapter 5. A prototype public database for proteomics 171
such as glycosylation, phosphorylation or biotinylation is obtained from the OntologyEntry
table. This has two clear advantages: firstly to reduce manual entry, as terms do not have
to be typed in each time, but are selected from a drop-down menu; secondly, errors and
imprecision should be reduced if the term is presented to the user with a clear definition,
ensuring that there is a shared understanding of exactly what is being specified. It would
not be possible to design an ontology, capable of capturing all terms used in any type of
study. The approach taken in RAD is that users can enter new terms when required, after
being checked by a member of MGED. A similar feature has been implemented in RAPAD,
whereby new terms can be added to the OntologyEntry table by contacting the author.
Terms are annotated as “user defined” along with a URI specifying the source of the term
and a definition to ensure that the origin of the term is clear.
A number of terms describing proteins and proteomics experiments have been added
to the OntologyEntry table during the development of RAPAD. It is important that this
controlled vocabulary is made available to others developing similar systems. The PSI is
developing an ontology (PSI-Ont) as an extension to MO, covering protein terms, and ul-
timately will provide a repository where developers can obtain and add new terms used in
proteomics studies. The vocabulary developed for RAPAD will contribute to PSI-Ont.
A separate part of GUS, known as SRes, stores phenotype information such as disease
states, bibliographic references and taxonomy information. SRes has been installed alongside
RAPAD, and stores a flat representation of parts of the NCBI taxonomy [224], which is in
effect an ontology of species. This means that the names of species are captured in a
controlled way, which facilitates database queries.
5.5 Discussion
The RAPAD system was developed with several main aims: to support the local proteomic
research requirement, to test the extension of RAD into proteomics as a prototype of a future
public repository for proteomics, to assess if FGE-OM correctly models the data semantics
and to test facilities for correlating changes in protein abundance with gene expression values.
In this section, the progress towards these goals is discussed.
5.5.1 A prototype of a central repository
RAPAD has been developed on top of RAD, which is a well established system grounded in
a significant amount of database research. RAD has robust facilities for storing structured
Chapter 5. A prototype public database for proteomics 172
descriptions of biological samples and experimental protocols, and uses ontologies to create a
standard representation of certain concepts. Protocols stored in this way can be queried more
easily than a free text description, and this opens the possibility for data mining in the future.
RAPAD makes use of the features from RAD that ensure data integrity and security, with
facilities for tracking which individuals have entered data, and restricting access to certain
information where necessary. The successful implementation of a proteomics database using
core RAD tables also demonstrates that parts of the schema could be used for other types
of functional genomics study, such as immunohistochemistry.
RAPAD has been tested by the developers of GUS. The developers have taken the
database schema and interface code, and work is underway to add the proteomics com-
ponent to GUS. The addition of proteomics support in GUS will be a major advance for
web sites, such as PlasmoDB that provides access to FG data for Plasmodium falciparum,
the causative agent of malaria. Large volumes of proteome data are being produced for P.
falciparum [110, 229] but there is currently no method for publicly releasing the material in
a format that can be queried, and it cannot be integrated with microarray or genomic data.
One of the goals of developing RAPAD was to build a prototype of a public proteomics
repository. The proteome extension of GUS is underway, utilising the RAPAD database
schema and interface code, demonstrating that the prototyping stage has been successful.
5.5.2 The relationship between FGE-OM and RAPAD
The object model specified in the previous chapter is a proposal for a data standard. However,
a specification expressed solely in UML cannot be used to test if the concepts of the domain
have been correctly modelled, or if real data can be captured in practice. One of the functions
of RAPAD is to demonstrate that real data can be captured by our proposal. We must first
establish the correspondence between FGE-OM and RAPAD, because the database schema
was not created automatically from the object model. Figure 5.2 displays the names of
classes in FGE-OM and tables in RAPAD that cover the same parts of the domain. The
attributes for the majority of tables are identical or very similar to those belonging to classes
in FGE-OM (the database schema and additional diagrams of FGE-OM are displayed in the
appendices). The BioOM and ArrayOM namespaces in FGE-OM contain classes derived
from MAGE-OM. The relationship between these classes and tables in RAD (now inherited
in RAPAD) has been established previously, and software is in development for automatically
converting between MAGE-OM and RAD [202]. Many of the tables in RAPAD that store
Chapter 5. A prototype public database for proteomics 173
proteome data are derived from the PEDRo database schema, and the PEDRo schema and
object model are virtually identical. Therefore, the parts of ProteomicsOM that are derived
from PEDRo are highly similar to the corresponding part of RAPAD. Finally, tables have
been created in RAPAD that exactly correspond with the parts of FGE-OM that are derived
from Gla-PSI. The overall result is that FGE-OM and RAPAD have a very similar structure,
and therefore it is reasonable to state that by illustrating the use of RAPAD in a real research
environment, it is demonstrated that FGE-OM correctly models proteome workflows. The
integration of gene and protein expression results was one of the main goals of developing
FGE-OM, and this functionality is demonstrated in RAPAD in the following chapter.
5.5.3 Support for current proteome studies
A second goal of developing RAPAD was to produce a system capable of supporting on-going
proteomics research, because the currently available databases do not offer all the facilities
that are required. The following two chapters describe projects that are supported by the
current implementation, however in this section a brief description of the main advantages
of RAPAD is given.
The database allows a structured description of experimental protocols and biological
samples to be specified using ontologies. This should improve the capabilities for querying
in the future as data sets become large. This feature is not included in SWISS-2DPAGE
and GELBANK, which only offer fairly simple descriptions of protocols. The data security
features inherited from RAD also provide a simple mechanism for allowing particular re-
searchers or groups to access or modify information in the database. This feature is vital for
large organisations in which many different levels of security could be required.
Data security models
In modern database management systems (DBMS) there are two broad approaches used to
ensure data security: discretionary and mandatory [66]. The security policy can be enforced
at various levels, such as over the entire database, on particular relations, or down to the level
of a single attribute of one row of data. The discretionary approach gives particular rights
to a specific user on different objects in the database, and different users may have different
rights on the same object. Therefore, this model is very flexible but it has a large overhead
if security settings have to be checked for many different objects and users. The alternative
security approach is the mandatory scheme in which certain database objects are assigned
Chapter 5. A prototype public database for proteomics 174
a particular classification, and users are given a clearance level that specifies which data are
accessible or can be modified. The mandatory approach is used in situations where data
fall into particular levels of accessibility, such as government or military databases where
controlling data access is of utmost importance. Security settings can be managed by the
security subsystem of the DBMS, and encoded as a set of rules that must be checked every
time an object is accessed or modified.
In RAPAD, a security system closer to the discretionary approach is employed at the level
of individual rows of data (tuples). However, there are currently no formal rules specified in
the DBMS, instead checks are made by the user interface to ensure that data has been speci-
fied as publicly accessible, or can be modified by a certain user and so on. The attributes that
specify the security setting exist for every tuple in the database. This approach is possibly
not as robust as having security rules set in the DBMS, but this would require a permanent
database administrator to update the rules with every new user or group that utilises the
database. The approach taken in RAPAD should in theory be more robust than ensuring
data security only at the level of the user interface. Additionally, the security settings can
be updated automatically without requiring a permanent database administrator.
Query capabilities
RAPAD has a query system that enables users to generate fairly complex queries to find
particular proteins in a study. The details of MS search results are stored which enable
the quality of a match to a protein to be determined, for example allowing a researcher
to exclude particular proteins that are only weakly identified. The results of a search over
different gels can be displayed in the Gel Viewer, which can load several gels simultaneously
for comparing the proteomes of different samples. The Gel Viewer has other features that are
advantageous compared with other databases, such as the facilities to zoom to an unlimited
depth to visualise small spots. The same region can be highlighted on a different gel to find
differences in the pattern of spots. The Gel Viewer can also display the name of proteins,
and the predicted pI and MW, which can be toggled on or off. There are capabilities that
enable researchers to search for possible post-translational modifications. These features are
exemplified in Chapters 6 and 7.
Chapter 5. A prototype public database for proteomics 175
Integration of gene and protein experiments
In the introduction, it was hypothesised that extending a database schema and graphical
user interface intended for microarray experiments into proteomics, would facilitate the in-
tegration of data across the two domains. In the following chapter, there is a description of
how the results can be integrated by matching the identifiers associated with gene expression
values to the identifiers for protein abundance. However, this is only part of the process.
An advantage of our approach is that biological samples and experimental protocols can be
entered into RAPAD, and are not linked to a particular experiment, but can be used in
any context. For this reason, a sample could be described a single time, using ontologies to
record the type of material, the source (company, organisation, contact details and so on),
and the species of origin. The sample description could then be associated with a microarray
hybridization, 2-DE, or an LC-MS analysis. When a large number of studies of this type
have been entered in the system, the RAPAD Querier will be capable of retrieving all the
experiments that have been performed on a particular sample. Therefore, integration occurs
at the level of results, as described in the following chapter, and at the level of the biological
samples and experimental protocols.
Availability of RAPAD
The database schema, the RAPAD Study-Annotator and the code for the Gel Viewer are
all freely available for download on the web site. Therefore, other developers can install
RAPAD locally to manage their own proteomics data, and there should not be a significant
overhead installing the current version. However, the current version has not undergone
several rounds of testing and therefore may require some modification or bug fixes once
implemented elsewhere.
Features of RAPAD demonstrate the feasibility of integrating proteomics and microar-
ray data in a single system (a specific example of this facility is described in the following
chapter). At present there are no well publicised systems offering this facility. The CEBS
SysBio system [355] hopes to offer similar capabilities in the future for data mining across
a range of experiment types, but a working prototype is not currently available. An inte-
grated database enables researchers to begin asking questions about the correlation between
gene expression and protein abundance at the global level. It is also thought that post-
translational modifications are important for protein function, and their relationship with
gene expression and protein abundance values has not previously been investigated. It is also
Chapter 5. A prototype public database for proteomics 176
likely that a proteome database could discover instances where proteins display modulated
regulation, which would not be observed at the transcriptome level.
5.5.4 Future developments
The current implementation of RAPAD supports proteomics research and can store microar-
ray data. It has been demonstrated that experimental hypotheses, biological samples and
protocols can be stored in common tables, regardless of whether a microarray or proteome
experiment has been performed. RAPAD could therefore be extended to cover metabolomics
experiments, given that metabolome data comprise column separations and mass spectrom-
etry. This would allow for integration across the transcriptome, proteome and metabolome,
giving a broad view of the biological system to the researcher. A future version of the
database could also incorporate a number of features that will improve facilities for data
mining. A number of links to external databases are already provided but this could be ex-
tended. For example, proteins that have a 3-D structure could be displayed using structure
visualisation software, such as RasMol [280] or Chime [216]. For certain studies it would also
be useful to correlate protein abundance with chromosomal location, this could be achieved
using the Expressionview software, which can display a microarray data set, and visualise
the position of genes on the chromosomes [109]. The relationship between chromosomal loca-
tion and gene expression is particularly important for bacterial studies because sets of genes
are often co-expressed from operons, and the genes within one operon often have related
functions.
Functional classification of genes and proteins in RAPAD is provided through dynamic
links to the Gene Ontology (GO), however a great variety of new software is currently in
development by a number of groups for summarising and correlating functional categories
with expression values. Additional software for querying and summarising GO, such as
GoMiner [368] (described in Chapter 2), will be installed alongside RAPAD when it becomes
available.
The current RAPAD implementation does not provide support for any detailed statisti-
cal analysis of data sets. The R software has a programmable interface that allows direct
connection to relational databases [261]. Therefore, pre-defined packages can be used to
search for significant differences in protein volumes, and correlations between gene and pro-
tein abundances. New packages can be also written in R for normalising across mRNA and
protein volume data, and for mining data to search for patterns of co-regulation. These
Chapter 5. A prototype public database for proteomics 177
features would enable protein abundance data to be queried in parallel with gene expression
studies, functional classifications and 3-D structures to improve the facilities for knowledge
acquisition. This kind of statistical analysis requires large data sets from gel electrophoresis,
and more research is required into the accuracy of relative protein volume between two or
more gels, detected by image analysis applications.
5.6 Conclusions
RAPAD supports proteomics research, and comprises a relational database with a web based
interface, which has been created by extending existing technologies. The system uses on-
tologies to capture knowledge in a standardised, controlled manner. This demonstrates that
re-using and integrating existing systems can facilitate integration of different types of data,
and that the time to develop a large system is significantly reduced, compared with develop-
ing de novo. The implementation also acts as a prototype for a major, public repository for
proteomics, which is currently in development. In the following two chapters, two specific
projects are described that allow the core features of RAPAD to be evaluated. The results
will illustrate how the software has enabled researchers to improve annotation of their data,
and formulate queries that facilitate new biological discoveries.
Chapter 6
Database support for proteomic
studies of host-parasite interactions
6.1 Introduction
The RAPAD system was created to test the feasibility of extending an established microarray
database into proteomics, as a step towards creating a single, integrated database for func-
tional genomics. In this chapter, an example is given of a project that is supported by the
current implementation of RAPAD, including an outline of how facilities of the database have
been specifically tailored for making new discoveries in this area. The biological investigation
aims to characterise changes in the proteome of host cells when invaded with the intracellular
protozoan parasite Toxoplasma gondii, from an in vitro culture. This chapter outlines how
the data from this investigation allows the core facilities of RAPAD to be evaluated. A de-
scription is given of additional software that has been developed for: (i) the visualisation of
proteins with modulated expression, (ii) the integration of the proteomics data with previ-
ously published microarray studies and (iii) the discovery of post-translational modifications.
The results enable researchers to formulate hypotheses about the biological processes that
occur during parasite invasion, and gain a better understanding of host-parasite relationships
in general.
6.1.1 Host-parasite interactions
The species Toxoplasma gondii, along with the other closely related parasites Plasmodium,
Cryptosporidium and Eimeria, pose major global problems to human and animal health.
Genome projects are well underway, and functional genomics investigations are being used to
elucidate the biological processes involved in the infectivity of the parasites [6]. Toxoplasma
is used as a model organism for studying related parasites because it is relatively easy and
178
Chapter 6. Database support for proteomic studies of host-parasite interactions 179
safe to culture in vitro, can invade animal models or host cell cultures, and possesses many
of the characteristics of its phylum, Apicomplexa [186]. T. gondii can infect a remarkably
wide range of hosts, including birds, livestock, humans and even oceanic mammals, such as
whales. The parasite is found in almost all geographic regions and infects 10-30% of human
populations [49]. Infection occurs after ingestion of oocysts from the faeces of cats, the
definitive host, or from tissue cysts in infected, undercooked meat. In the majority of cases,
T. gondii forms cysts in the deep tissues, including the brain, where it maintains a life-long
chronic infection. Toxoplasma induces disease in certain cases: (i) the parasite can cross the
placenta to the foetus, causing congenital defects or abortion; (ii) T. gondii can also be fatal
in immuno-compromised patients, for example in individuals with AIDS. It is believed that
the tissue cysts rupture, enabling the parasite to switch from the latent form (bradyzoite)
to a rapidly dividing form (tachyzoite), killing host cells. Therefore, one of the areas for
further investigation is to discover the factors that cause a switch between the two forms
[198]. Substantial work has also been carried out to identify the parasite and host proteins
that are critical for infectivity, and to elucidate the pathways in which they function. It
is believed that the method of invasion is conserved across the Apicomplexa, therefore any
discoveries made in T. gondii could have far reaching consequences.
The parasite invades by the following mechanism (reviewed by Sibley 2004 [291]). Toxo-
plasma releases molecules (adhesins) that attach to surface receptors on host cells. The para-
site actively penetrates the membrane, enclosing itself within a vacuole (the parasitophorous
vacuole) that is primarily formed from the host’s cell membrane, thereby reducing the ability
of the host to recognise and reject the parasite. The parasite releases the contents of a set of
organelles into the cytosol, including rhoptries that are crucial for parasite infectivity [29].
Rhoptries release a set of proteins that cause the parasitophorous vacuole to interact with
host cell mitochondria and endoplasmic reticulum, allowing the parasite to scavenge glucose
and cholesterol. An understanding of the proteins and pathways involved in infectivity has
been developed over several decades using classical techniques, such as gene knockout experi-
ments [185], but developments in technology have now opened up the possibility of analysing
the systems on a much larger scale.
6.1.2 Genomic investigation of Toxoplasma
The genome of T. gondii is currently being sequenced [187] and access to large EST databases
has been available for several years [321]. Therefore, T. gondii can now be investigated using
Chapter 6. Database support for proteomic studies of host-parasite interactions 180
functional genomics techniques, allowing researchers to gain a wider view of the systems
involved with infectivity than previously possible. The genome is 80Mb (Megabases) in
size, contains 11 chromosomes and, as of early 2004, there is a ten times coverage of the
sequence [320], created using the “shotgun” approach [52]. Many genes have little or no
functional assignment, therefore any studies that provide insights into gene function will aid
the annotation efforts. A previous study investigated the constituents of the proteome of the
tachyzoite (rapidly dividing stage) of T. gondii by two dimensional gel electrophoresis (2-DE)
[61]. The study discovered that the same proteins appear in a number of positions on a single
gel, indicating that differential splicing of gene products, or post-translational modifications
are common. A separate investigation into the proteomics of Toxoplasma demonstrated a
protocol for a 2-D gel map of the tachyzoite stage [79]. Microarray studies have also been
carried out by Gail [116] and Blader [35], discussed below.
6.1.3 Microarray analysis
A more detailed understanding of the function of proteins from Toxoplasma, and the com-
plex networks of interacting proteins, will greatly facilitate the search for new drug targets.
However, researchers also wish to focus on how the parasite interacts with host cells, and
what changes occur in the functioning of the host cells. Microarray studies by Blader and
colleagues [35] determined the genes that are significantly up or down-regulated in a host
cell culture (Human Foreskin Fibroblasts, HFF) when invaded by the parasite, compared
with non-infected host cells, at a number of time points after invasion (1-24 hours post in-
fection). Several groups of genes displaying modulated expression were defined, leading to
hypotheses about the mechanisms of parasite invasion, and the recruitment of host processes
for its own survival. It is believed that the parasites arrest the cell cycle to enable them to
continue utilising host resources as long as possible. An important mechanism for host cell
defence against parasites and viruses is the apoptosis cascade, which causes host cells to die,
thus preventing further development of the intracellular pathogen. Evidence suggests that
Toxoplasma switches on a number of host genes that inhibit and prevent the propagation
of the apoptosis cascade [292]. The microarray results revealed down regulation of genes
implicated in mitosis and meiosis (cell cycle processes), apoptosis genes, and cytoskeletal
proteins. The role of calcium dependent signalling during parasite invasion has also been
studied in detail (reviewed by Arrizabalaga and Boothroyd [17]). Some evidence suggests
that Toxoplasma utilises its own calcium dependent pathways, unlike other parasites that
Chapter 6. Database support for proteomic studies of host-parasite interactions 181
sequester host pathways, therefore one area of study is to determine if there are also changes
in the host genes implicated in these processes. Blader also discovered up-regulation of
genes involved in glycolysis and cholesterol synthesis for energy generation. Infection by
the parasite is resisted by host cells, and therefore it is expected that an up-regulation of
genes involved in the immune response would be observed. In the microarray study, an early
up-regulation of these genes was observed, at one hour post-infection.
A later study by de Avalos, Blader and colleagues [73] performed a microarray experi-
ment, similar to Blader 2001, on the related organism Trypanosoma cruzi. T. cruzi is also
an intracellular pathogen believed to invade by a similar mechanism. The results indicated
that very few host genes were up-regulated early in infection, unlike the T. gondii data,
and across the whole data set, the correspondence in up-regulated genes between T. gondii
and T. cruzi was very low. This has important implications for general understanding of
host-parasite interactions. It has previously been thought that the response of host cells
to invasion by a parasite would be the same, or similar, regardless of the type of parasite.
However, the comparison of the T. gondii and T. cruzi data suggests that there may be
different mechanisms used by host cells to respond to invasion by different parasites. The
consequence of this finding is that drug development should be targeted towards disrupting
very specific processes for specific parasites, rather than targeting a single set of processes
to prevent invasion by any kind of parasite. It is important that the host responses to a
number of parasites are studied in more detail to elucidate the mechanisms involved.
6.1.4 Support for proteome studies
RAPAD is supporting a project from the laboratory of Jonathan Wastling in the Institute
of Biomedical and Life Sciences at the University of Glasgow. The investigations were
performed by Morag Nelson, a PhD student, as part of a project to investigate the changes
in the proteome of mammalian host cells when invaded with T. gondii, compared with non-
infected host cells, at 24 hours post infection. The investigation uses 2-DE for protein
separation, coupled with MS (mass spectrometry) for protein identification. The specific
aims of the biological investigation are as follows. Firstly, to verify if changes observed at
the transcriptional level (by microarray analysis) are confirmed by changes in the amount of
protein produced. Secondly, it is believed that because proteins are the functional unit in the
cell, protein abundance is a better indicator of functional significance than gene expression
values. Therefore, new groups of proteins could be discovered with modulated expression,
Chapter 6. Database support for proteomic studies of host-parasite interactions 182
which were not found by microarray analysis, leading to the formation of novel hypotheses.
A third aim is to investigate what role post-translational modifications (PTMs) might play
in parasite infectivity.
The experiments present considerable computational challenges that enable the evalu-
ation of the core facilities of RAPAD in three key areas: managing large volumes of data
across replicates, enabling complex queries, and visualisation of results to allow new findings
to be derived. In this chapter, we report on additional work by the author to develop specific
queries and visualisation software, in order to enable differential expression of proteins to
be detected across two conditions from a number of replicate gels. Facilities have also been
developed to integrate microarray data points with the corresponding proteins identified by
MS, and to support the storage and querying of PTMs, in conjunction with gene expres-
sion and protein abundance data. The integration of transcriptome and proteome data may
answer several questions:
• The interval between changes in gene expression and protein abundance. If
genes are up-regulated immediately after infection, when are changes observed in the
level of protein?
• Translational control: are there groups of proteins with modulated expression that
were not associated with a change in gene expression?
• Post-translational modification: do groups of proteins undergo changes in modifi-
cation status that are functionally significant, where there is no change in the rate of
transcription?
6.1.5 Project status
The current status of the biological study is as follows. 14 gels produced by 2-DE, from seven
infections with T. gondii and seven non-infected cell lines, have been loaded into RAPAD.
From the gels, approximately 350 differentially expressed spots have been identified. There
are 130 distinct proteins out of the 350, because some proteins appear in multiple copies in
different places on the 2-D gels, and in some cases the same protein has been identified on
replicate gels. Currently, about 40 proteins spots (14 distinct) have been matched to the
corresponding microarray clone, although it is expected that this number will increase as the
number of protein records in RAPAD increases (discussed in Section 6.4).
The rest of the chapter is structured as follows. Section 6.2 briefly describes the biological
Chapter 6. Database support for proteomic studies of host-parasite interactions 183
methodology, how differential protein expression data is visualised, how microarray data
points are matched to protein records, and the techniques employed to assign, display and
summarise functional classification of proteins. An overview of the results is given in Section
6.3, focusing on how RAPAD has supported the generation of new hypotheses. Discussion
is provided in Section 6.4.
6.2 Methods
The source of biological material for the investigations was a human foreskin fibroblast
(HFF) cell line, which was prepared and infected with Toxoplasma gondii, using a protocol
reproduced from the microarray study by Blader et al. 2001 [35]. This should ensure that
the proteome data from these studies are, as far as possible, comparable with the earlier
microarray analysis. The experimental protocols for protein solubilisation, the IPG strip
(first dimension separation), the gel electrophoresis stage (second dimension) and staining
are stored in RAPAD. Eleven biological replicates (infected versus non-infected, 22 gels)
were performed but most of the examples given are from a single replicate (replicate 11).
Coomassie blue stain was used to visualise proteins, gels were scanned with a standard
laboratory scanner and images were analysed using the ImageMaster 2D Elite software [162].
The matching of spots on two different gels (pairwise between replicates) was performed by
the 2D Elite software, which also measured the spot volumes. Differential expression of
proteins was determined as follows. Spots with a volume difference of greater than 30% were
picked for MS analysis, or spots that were present on one gel and not on the other determined
by manual inspection. The gels were normalised to background on a per spot basis, after
background subtraction had taken place, using the method “normalisation at lowest on
boundary”. The spot coordinates and volumes determined by 2D Elite were imported into
RAPAD. Samples were sent for MALDI-TOF (Matrix Assisted Laser Desorption Ionisation
- Time of Flight) analysis and identifications were made using the MASCOT software [207].
The samples that did not produce a significant protein identification were analysed using a
tandem MS system (AB Q-Star Pulsar).
The contribution of the author was: (i) to develop the core RAPAD system, as described
in the previous chapter; (ii) to create additional displays of differential expression (Section
6.2.1); (iii) to write software for matching gene expression values to protein abundance data
(Section 6.2.2); and (iv) to develop scripts to retrieve identifiers that enable hyperlinks to be
created from RAPAD to external software, in order to provide a summary of the functional
Chapter 6. Database support for proteomic studies of host-parasite interactions 184
classification of each protein in this specific study (Section 6.2.3).
6.2.1 Display of protein data from different gels
The previous chapter described facilities in the RAPAD Gel Viewer. The Gel Viewer enables
multiple gels to be loaded simultaneously, the results of searches to be viewed, and offers the
display of the predicted charge and molecular weight of the proteins. These features allow
a researcher to search for PTMs and analyse the proteins that have been identified in the
study. For the Toxoplasma investigation an additional interface was created to improve the
visualisation of proteins that are differentially expressed on 2-D gels. The interface addresses
the problem of spot interpretation where certain proteins appear in several different positions
on individual gels, corresponding to particular PTMs or differentially spliced forms of the
protein. A series of gels have been performed from replicate samples, in which there may be
supporting or contradictory evidence, and this information must be assimilated. The first
goal was to develop additional software to aid researchers to define the spots on different
gels that correspond to the same protein.
EXAMPLE: Spots matching protein XYZ1 appear ten times on gels from infected samples
(across replicates), and three times from non-infected samples (exemplified in the Results
section, Figure 6.4). A visualisation has been created that shows the exact regions that
XYZ1 appears on the different gels, to enable the researcher to say how many different
forms of XYZ1 exist in total, and which different forms are up or down-regulated. It may
be the case that the three spots containing XYZ1 from the non-infected sample correspond
to a particularly modified form of the protein that has the same abundance in infected and
non-infected samples. However, the additional spots from infected samples correspond to a
different form of XYZ1, which is produced in greater abundance, and is crucial for parasite
infectivity. The visualisation system displays which forms of the protein are up-regulated,
down-regulated or have stable abundance.
A query has been developed in RAPAD that returns a page that lists the proteins that
have been identified across all replicates. The researcher selects the proteins they wish to
investigate and the Gel Viewer opens, highlighting the proteins selected on all replicates,
with each replicate gel loaded in a separate tabbed window of the viewer. The researcher
can zoom on the proteins and note the ID numbers of spots in the same position. The ID
numbers of spots in the same positions are manually entered into a text file by the researcher,
Chapter 6. Database support for proteomic studies of host-parasite interactions 185
and it is loaded into the database (into the tables MatchedSpots and MultipleAnalysis).
This allows a spot set to be defined that corresponds to the same form of the same protein on
different gels. After a spot set has been defined, a second interface displays the spot sets and
the volume of individual spots on different gels, if these values have been entered in RAPAD,
to display which spots appear in greater or lesser volume. This should allow the researcher
to define a particular variant of a protein (one spot set) as up or down-regulated during
infection. Section 6.3.1 gives an example of how the software has been used in practice to
identify differentially expressed proteins in the biological investigation.
6.2.2 Comparison of protein and gene expression data
The experimental protocol for infecting an HFF cell line with T. gondii for the proteome
study was reproduced from Blader’s study, as detailed above, therefore it should be possible
to make comparisons between the expression of a gene measured by a microarray, with the
protein abundance value obtained in this study. The microarrays of Blader were created
according to a standard protocol, and are supplied with an identifier of the cDNA clone
(example IMAGE:123456) and of the GenBank cDNA record. The cDNA record does not
share an identifier with either the protein record returned by MASCOT [207] (the software
used to identify proteins following MS), or with the corresponding nucleotide record found by
following a link from the protein record. Therefore, performing matching between microarray
clones and protein sequences is not a trivial task.
An initial attempt to find corresponding gene and protein records used pattern matching
over the names of the microarray features and the protein names, expressed as an SQL query,
in the following way. A query is deployed to match the first word of both the microarray
clone name, and the protein name. A list of exceptions is generated where the first word
occurs frequently and is not informative, such as “hypothetical” or “protein”. In these cases,
other words in the protein name are analysed to find matches. A list of potential matches is
supplied to the user, and sensible matches are returned in only approximately 50% of cases
because the following problems arise:
• If synonyms exist for gene names, one name may be used for the cDNA clone and
another for the protein.
• Certain words occur frequently in gene names which cause incorrect matches to be
found, such as “alpha” or “beta”.
Chapter 6. Database support for proteomic studies of host-parasite interactions 186
Retrieve DoTS ID numberfor each sequence
Store local copy of DoTSID for each sequence
Store mapping from DoTS IDto DoTS gene record
FGB451.2HYAB22.1DDRA44CAB224.2LF11AH.1
QARTGH
....
RAPAD
OUTPUT: Microarray gene name | Gene expression value| Protein Name | Protein Volume
PDB
2−D gel dataMass SpectrometryList of cDNA clone IDs
ABDG45.3NW4523HWEIU9.1JKHL652.1HGF456.2
NMD123.1
....
PIR
List of Protein IDs
List of Genbank nucleotide IDs
Retrieve DoTS gene number for every microarray result Retrieve DoTS gene number for every protein
Join query
Swiss−ProtGenbank
Retrieve Genbank nucleotide IDsusing BioJava
DoTS at AllGenes.org
Figure 6.1: The process of matching microarray data to protein abundance data.
Chapter 6. Database support for proteomic studies of host-parasite interactions 187
• Gene families exist with a number of closely related entries, such as Tropomyosin 2,3
and 4 which have closely related sequences, therefore microarray clones or protein
records may have been annotated incorrectly, or a search with the MS data may return
the incorrect entry.
• More generally, annotation in the databases is prone to inaccuracy and is being con-
stantly refined.
Further improvements to the algorithm for matching names would improve specificity but
it would be very difficult to engineer a robust method that would succeed in all situations.
Therefore, a different approach has been implemented in RAPAD using AllGenes [10]. All-
Genes is a web site that provides access to the Database of Transcribed Sequences (DoTS)
that collects all the different identifiers that a particular sequence (cDNA, mRNA, DNA)
could be assigned, which correspond to the same underlying gene. For example, the gene:
“heterogeneous nuclear ribonucleoprotein F” has a GenBank record for the protein sequence
(gi|4826760), nucleotide record (NM 004966), a microarray specific ID (IMAGE:345833), and
the corresponding cDNA GenBank ID (W72693). The DNA, cDNA and microarray identi-
fiers are each assigned a DoTS number, and collections of DoTS entries that correspond to
the same underlying object (gene) are assigned a single DoTS gene number (DG.36388269).
DoTS entries have been created by performing sequence similarity searches, and assembling
clusters of sequences that corresponds to the same object. A significant number of DoTS
entries have been manually curated.
The following series of actions is used to match protein records to microarray clones
(summarised in Figure 6.1).
1. RAPAD stores a URL referencing a web page on an external server for visualising
MASCOT results. A script retrieves GenBank protein IDs from the web page.
2. Protein records are retrieved from GenBank using the API (Application Programming
Interface) provided by BioJava [34]. Many GenBank protein records have a link to
the corresponding nucleotide record under the data type: DB Source, except for cases
where the protein sequence originated from a 3-D structure, or a database other than
GenBank, such as Swiss-Prot. In these cases, the nucleotide record must be found by
following a series of links manually (approximately 10% of proteins), or performing a
sequence similarity search on the GenBank nucleotide database.
Chapter 6. Database support for proteomic studies of host-parasite interactions 188
3. The DoTS web site allows programmatic access for single entries, and has batch capa-
bilities, but does not currently scale up for accepting very large numbers of identifiers.
Therefore, the DoTS database has been downloaded in flat files, and the UNIX grep
utility was used to search the files for the DoTS identifiers for GenBank nucleotide
records (found automatically from MASCOT or found manually) and cDNA records
(from the microarrays).
4. DoTS identifiers for microarray clones or proteins are stored in a newly created table
in RAPAD. A mapping from all DoTS identifiers to the corresponding DoTS genes is
stored in a table in RAPAD that can be queried when required.
5. An SQL query finds DoTS numbers for every protein, and retrieves the corresponding
DoTS gene number. A search is performed to find any microarray features that have
a DoTS number that has been mapped to the same DoTS gene ID.
The results of matching protein data to microarray results are displayed in the RAPAD
interface in a table, showing properties of the protein with links to the full protein record.
The microarray results from the different time points are displayed alongside. If the protein
has been matched across the two gels (infected and non-infected in this case), and volume
measures have been found for the two gel spots, the ratio in protein volume is displayed
alongside the microarray results. When large datasets are assembled it should be possible
to determine the correlation between gene expression and protein abundance for a series of
time points. This will enable the lag between the up-regulation of a gene and the production
of new protein to be calculated on a large scale.
6.2.3 Functional classification of proteins
Proteomics experiments generate large quantities of complex data, therefore analysis is re-
quired that can provide summaries, to generate a better understanding of the whole system.
The biological investigation reported in this chapter is analysing the changes that occur in
the human proteome, and there are a great number of resources available for characterising
human proteins. One example is the Gene Ontology (GO) project [126] (described in Chap-
ter 2), which has assembled a large amount of information about the function of proteins. In
RAPAD, GO ID numbers are stored for all proteins identified in this study and hyperlinks
have been created to the AmiGO browser [12]. AmiGO graphically highlights the position of
the term, and has controls for traversing up and down the GO tree, enabling the researcher
Chapter 6. Database support for proteomic studies of host-parasite interactions 189
Figure 6.2: Output from GoMiner, displaying the GO tree browser open for the geneTropomyosin 1.
to view the hierarchical classification of a gene (or protein). However, this system is not ideal
for a large collection of proteins because the knowledge about function must be manually
assembled by browsing, and is difficult to summarise because it is difficult to know from
which depth of the tree to store functional information. For certain proteins, the lowest
depth may provide useful annotation, but in other cases a more general classification (higher
up the hierarchy) may be more informative. Therefore, additional tools have been used that
summarise GO classifications: GoMiner [368] and FatiGO [7].
GoMiner accepts a list of gene symbols1 from one or two experiments, and displays
summaries of where genes have been found in the hierarchy. GoMiner also displays which
branches of GO are linked to genes that are up or down-regulated with statistics (described in
more detail in Chapter 2). For example, if three genes involved with cytoskeletal development
are up-regulated and one is down-regulated, this result would be displayed graphically, with
a statistic indicating that, for this set of conditions, cytoskeletal proteins tend to be up-
regulated (Figure 6.2).
FatiGO provides access to GO over the Internet, and has similar goals to GoMiner.
1A gene symbol is an official annotation for every human gene from the Human Genome Organisation(HUGO) [156]. Example: the gene actin beta has the gene symbol ACTB.
Chapter 6. Database support for proteomic studies of host-parasite interactions 190
Figure 6.3: Output from FatiGO showing the classification of up and down-regulated proteinsin the Biological Process branch of GO at a depth of 3, the third lowest (Query = infectedcells, Reference = non-infected cells).
Chapter 6. Database support for proteomic studies of host-parasite interactions 191
FatiGO provides summaries of where up and down-regulated genes appear in GO. FatiGO
accepts lists of gene symbols that have been highlighted from two experiments, and allows
the user to select the depth of the hierarchy and which branch of the three classifications
in GO to display. A visual summary of results is displayed with p-values to indicate the
significance of the association between one of the two conditions in the experiment and a
particular branch of GO (Figure 6.3).
FatiGO and GoMiner can also be used to classify proteins instead of genes, but both tools
require gene symbols as input rather than GO identifiers or GenBank accession numbers.
A set of scripts were developed by the author to retrieve the gene symbols from GenBank
nucleotide records for all the proteins highlighted in this investigation. The gene symbols are
stored in RAPAD, and are also used to create web links to the Ensembl genome browser [58]
for visualising the chromosomal location of the gene, as well as linking to GenAtlas [121] and
GeneCards [268]. GenAtlas and GeneCards summarise information about the function of
genes, display the intron/exon structure, provide physical maps showing other genes in the
localised region, give expression values in different human tissues, and display the domains
of the protein.
6.3 Results
The introduction outlined several key changes that are thought to occur in host cells when
invaded with Toxoplasma gondii. The proteome project had several major hypotheses to test,
which required significant database support. In this section, an outline of the results of the
analysis is given in four areas: the display of differentially expressed proteins, software that
aids the functional annotation of proteins, the integration with microarray results and the
search for post-translational modifications. The purpose of this chapter is to focus on how
RAPAD has facilitated these processes for the experiments with Toxoplasma, using several
examples of proteins highlighted by the study which may have a role in the infectivity of
the parasite. The proteome investigation is still continuing, and a complete report of the
biological results is beyond the scope of this work.
6.3.1 Visualisation of differential expression
The development of software for the visualisation of spots on different gels corresponding to
the same protein was described in Section 6.2.1. In this section an example is given of the
usage of the software, in the context of the T. gondii infection data.
Chapter 6. Database support for proteomic studies of host-parasite interactions 192
Spots 29 and 27
Spots 25 and 24
Spot IDs 42 and 41
Figure 6.4: The interface for viewing spots across replicate gels. A table displays proteins or-dered by name, allowing the researcher to select entries that have been identified as the sameprotein across different replicates, in this case ACTB. The Gel Viewer opens, highlightingthe proteins in different windows to allow the researcher to assess which spots correspond toeach other on different gels. A polygon has been overlaid to demonstrate that spots 42 and41 from non-infected replicate 1 appear to correspond with spots 29 and 27 from non-infectedreplicate 3. Gel images courtesy of M. Nelson.
Chapter 6. Database support for proteomic studies of host-parasite interactions 193
The process is demonstrated in Figure 6.4 for six spots containing the protein ACTB,
which appears in 26 spots in total. In this example, there is a cluster of four spots matched
to ACTB on one gel, and two spots on a replicate gel. The corresponding region has been
highlighted for the two gels, and a polygon has been drawn2 to demonstrate that spots 41
and 42 from non-infected replicate 1 correspond with spots 27 and 29 from non-infected
replicate 3. In this example, spot 42 (replicate 1) and spot 29 (rep. 3) form one spot set 3
and spot 41 (rep. 1) and spot 27 (rep. 3) form a different spot set. The region can then be
compared on gels from infected samples to see if this particular form of the protein, in this
exact position, is up or down-regulated.
The Gel Viewer, in combination with the RAPAD query system, allows differential ex-
pression of proteins to be visualised. An additional view of the data has been created which
will allow the results to be made public when the study is published in a journal. A total of
130 differentially expressed proteins have been identified by the researcher, which are stored
in RAPAD. Figure 6.5 displays the interface for viewing data that is combined across repli-
cates. The data can be viewed in a table that provides links to the individual protein records,
and enables any number of proteins to be selected and opened within the Gel Viewer. There
are facilities for investigating the function of the proteins, addressed in the following section.
6.3.2 Functional annotation of proteins
The software described in Section 6.2.1 facilitates the determination of a set of proteins that
show changed expression between infected and non-infected host cells. Each protein record
has links to a number of external databases: GenBank displays the nucleotide and protein
sequence; Harvester [33], GenAtlas, and Genecards summarise a large amount of information
that has previously been assembled for each entry; and Ensembl enables a researcher to
visualise the chromosomal location of a gene. A link to the Gene Ontology record for the
protein is also provided, allowing the researcher to build a complex picture of the function
of each protein. RAPAD includes an option for annotating a protein spot with a textual
description, thereby allowing new findings, that have been derived from external sources, to
be recorded in the database.
Proteins with modulated expression in this study could potentially fall into three cate-
gories:
2The polygon was created manually by the author to clarify which spots correspond to each other acrossthe two gels.
3A spot set is defined as a group of spots in the same position on different gels, corresponding to a specificisoform of the protein.
Chapter 6. Database support for proteomic studies of host-parasite interactions 194
Figure 6.5: The interface for displaying data combined across replicates. The top imagedisplays the option for assigning groups of gels to two different conditions (infected versusnon-infected). The lower image shows the table of proteins that have been identified in eachgroup of gels.
Chapter 6. Database support for proteomic studies of host-parasite interactions 195
• Host proteins actively up or down-regulated by the parasite, required for invasion or
maintaining infection.
• Proteins expressed by host cells in an attempt to resist parasite infectivity.
• Proteins with altered expression, caused indirectly, as a result of other proteins being
up or down-regulated.
It is therefore important to consider when analysing changes to the host proteome, whether
or not the change is facilitating parasite infectivity, as this has major consequences for the
interpretation placed on the result.
Example: Differential expression of Cathepsin B
One of the proteins found to be differentially expressed by the researchers was the protein
Cathepsin B, which cleaves proteins, transforming them from their initially transcribed form
(the prepro protein) into the functional form. Previous studies have suggested that Cathepsin
B from T. gondii is required for infectivity and rhoptry protein processing [260]. The pro-
teome studies described here, along with the previous microarray experiments, suggest that
human Cathepsin B is down-regulated during infection. A study by Que et al. in 2002 [260]
demonstrated that inhibition of Toxoplasma Cathepsin B prevented the parasite from infect-
ing cells, and was therefore a potential drug target. The study by Que also demonstrated
a significant sequence and structural similarity between human and Toxoplasma Cathepsin
proteins. Therefore, the finding that human Cathepsin B is down-regulated during infection
raises the possibility that human Cathepsin interferes with correct processing of Toxoplasma
proteins. If this proved to be correct, induction of expression of human Cathepsin could
prove to be an inhibitor of Toxoplasma infectivity. However, the situation is more complex
because human Cathepsin has also been implicated in the apoptosis pathway [139], and
one of the critical factors enabling a parasite to maintain infection is inhibition of apop-
tosis. Therefore, Toxoplasma may cause the down-regulation of Cathepsin to prevent the
cell entering apoptosis. This demonstrates that there is a significant information retrieval
task required to understand the results after particular proteins have been highlighted. The
interface provided by RAPAD allows the researcher to assimilate the results from past experi-
ments rapidly, via other Internet accessible resources (Figure 6.6), and record the information
within the database.
Chapter 6. Database support for proteomic studies of host-parasite interactions 196
Figure 6.6: The protein record for Cathepsin B in RAPAD has external links to AmiGO[12], GenBank [30] and GeneCards [268].
Chapter 6. Database support for proteomic studies of host-parasite interactions 197
Summary of biological results
Since the results of the investigation with T. gondii will be published by Dr Wastling and
Morag Nelson at a later date, a complete description of the results of the biological inves-
tigation is outside the scope of this work. When the results are ready for publication, the
RAPAD interface will provide public access to the data, as described in Section 6.3.5.
Cathepsin B, described above, is one of many proteins found to have modulated expres-
sion during parasite infectivity, which demonstrates the effectiveness of the experimental
approach adopted by Dr Wastling and the software developed in this investigation. Initial
results from the proteomics investigation have discovered down-regulation of proteins in-
volved in the formation of the cytoskeleton, as expected due to the ability of the parasite
to halt new cell growth and cell division. Other proteins implicated in apoptosis, such as
cytochrome c, are also down-regulated, and there is an up-regulation of proteins involved in
the host’s response to stress. The following section describes work by the author to match
the protein abundance values from this study, to gene expression values from the previously
published microarray experiments. Several examples are given of proteins that have been
shown to be differentially expressed, which have been highlighted for further investigation.
6.3.3 Comparison with microarray data
We have developed software to match proteins identified by MS to the corresponding clones
from the microarray study by Blader and colleagues, in order to discover the correlation be-
tween gene expression and protein abundance. The Blader experiment contains two relevant
datasets. The first is a time course experiment to highlight genes with altered expression
at 1, 2, 4, 6 and 24 hours post infection with T. gondii. The second data set from Blader’s
microarray experiment contains an analysis, from two independent infections, of the genes
that were most strongly up or down-regulated at 24 hours post-infection. The proteomics
experiment carried out at Glasgow determines the abundance of proteins at 24 hours post-
infection. It is likely that there is a lag between an up-regulation in gene expression, and
the production of new protein, although the length of time is not known exactly.
The technique to match data points present in both data sets performed correct matching
between gene and protein identifiers. However, due to the limited coverage of both exper-
iments, the datasets are not currently large enough to infer global information about the
rate of translational control for Toxoplasma proteins. The results of the matching, displayed
in Table 6.1, provide qualitative information about the correspondence between the rate of
Chapter 6. Database support for proteomic studies of host-parasite interactions 198
Figure 6.7: The table in RAPAD displaying protein abundance and gene expression values.The column headings are as follows: 1 = Spot ID, 2 = Protein name, 3 = cDNA clonename, columns 4 to 8 are relative gene expression values from a time course experiment, andcolumns 9 and 10 are relative expression values from a separate microarray hybridization(24 hour time point, see Section 6.3.3). Column 10 = spot ID of matching spot on a secondgel and column 11 is the ratio of protein volume between the two gels.
Chapter 6. Database support for proteomic studies of host-parasite interactions 199
Protein Name 1h 2h 4h 6h 24h 24h(i)
24h(ii)
Up-regulatedAnnexin-1 2.02 0.65 1.18 0.95 0.79 — —Heterogeneous ribonucleoprotein F — — — — — 2.77 2.30HS70kDa protein 8 isoform 1 — — — — — 2.58 2.22Nucleoside diphosphate kinase 1 1.82 0.75 1.37 1.12 2.55 — —Phospholipase C alpha or Protein disulphideisomerase
Table 6.1: The correspondence between gene and protein abundance for HFF cells infectedwith T. gondii. Column 1 contains the names of proteins identified in the proteome study,which are up or down-regulated during parasite infection. The numerical values are thecorresponding gene expression values from the study by Blader [35] from a time courseexperiment (columns 2-6) and two independent infections at the 24h time point (columns 7and 8). The values are the ratio of the expression of the gene that corresponds to the proteinin column 1, from infected versus non-infected samples. A value greater than 1 indicates thegene is up-regulated during infection, less than 1 indicates that the gene is down-regulated.The — symbol indicates that the value was not present in the Blader study.
Chapter 6. Database support for proteomic studies of host-parasite interactions 200
2)
2)
1)
1)
Figure 6.8: The top image displays a part of the gel from the infected sample at a highermagnification, and the bottom image is the non-infected sample. Spots matched to vimentinare highlighted. The cluster of spots marked 2 is present on both gels. The cluster of spotsmarked 1 is only present in non-infected samples. Gel images courtesy of M. Nelson.
Chapter 6. Database support for proteomic studies of host-parasite interactions 201
transcription and translation. The first column in Table 6.1 displays the proteins that have
been found to be up or down-regulated during infection in the proteomics investigation, and
have been matched to a gene in the Blader study. Proteins are identified as up-regulated
in infected samples if they appear in a larger volume on gels from infected samples, or the
spot is present in the infected sample and absent in the non-infected sample. A protein is
defined as down-regulated if it appears in a larger volume, or is only present on gels from
non-infected samples. Columns 2-6 display the expression values at the five time points post-
infection from the Blader study for the gene that corresponds to the protein in column 1.
Columns 7 and 8 display the expression values for genes that have been matched to proteins
in this investigation, from two further independent infections at 24 hours post-infection in
the Blader study. Table 6.1 summarises fairly complex data, as for example vimentin and
actin both appear in multiple copies on gels from infected and non-infected samples. Both
vimentin and actin are defined as down-regulated because there are spots clearly present
across replicates on non-infected samples, which are not present on infected gels. Figure 6.8
displays the spots matched to vimentin from infected and non-infected samples. The spot
cluster 2 is present on both gels in roughly similar volumes. Cluster 1 is only present in non-
infected samples. This indicates that several forms of vimentin with particular modifications
are down-regulated during infection.
The spots matched to actin beta are displayed in Figure 6.9. The pattern of spots indi-
cates that particular forms of actin beta are less abundant during infection, or it may reflect
the fact that the total volume of all spots is reduced, and certain spots cannot be viewed at
very low volumes. Both vimentin and actin are implicated in cytoskeletal development, and
may be down-regulated because Toxoplasma arrests the host’s cell cycle. Tubulin beta and
heterogeneous ribonucleoprotein F appear in both halves of the table because some forms of
the proteins appear in greater volumes in infected samples, and other forms in non-infected
samples. Therefore, there may be a different type of modification that causes spots to shift
positions on the 2-D gel, and it is not possible to state simply whether the proteins are up
or down-regulated.
Up-regulated proteins
There are three genes: HS70kDa protein, protein disulphide isomerase and thioredoxin per-
oxidase that are strongly up-regulated in the Blader study at 24 hours, and the proteins
are also up-regulated in this investigation. HS70kDa is a heat shock protein that is released
Chapter 6. Database support for proteomic studies of host-parasite interactions 202
Figure 6.9: Spots matched to actin beta from infected (top) and non-infected (bottom)samples. Gel images courtesy of M. Nelson.
Chapter 6. Database support for proteomic studies of host-parasite interactions 203
when the cell is placed under stress, therefore it may represent a host cell response to infec-
tion. Thioredoxin peroxidase is implicated in oxidative stress and regulation of transcription
factors, and may also be a sign of a host cell response.
The comparison data reveals that both phospholipase C alpha (PCA) and protein disul-
phide isomerase (PDI) are predicted to match the same microarray clone, annotated as a
“glucose regulated protein” (accession R33030). The 2-D gel data also reveals that spots
containing phospholipase C alpha are also predicted to contain PDI, based on MS results.
Further analysis reveals that GenBank contains exactly the same protein sequence for both
PCA (BAA03759) and PDI (JC5704). The Harvester database contains a different, unrelated
protein sequence for PCA (Harvester ID Q15111), but PDI has the same protein sequence in
Harvester and GenBank. This indicates that the PCA record in GenBank contains an incor-
rect protein sequence. It appears that both the proteomics and microarray data agree that
PDI is up-regulated in response to parasite infection. PDI functions to rearrange sulphide
bonds in proteins, and the up-regulation may be due to a general increase in proteins that
must be produced during infection. PCA may not be implicated in this study at all, and
if it is incorrectly annotated in GenBank, the record should be updated. The public access
part of RAPAD, described in Chapter 5, will allow other databases to connect to RAPAD
when the proteome data has been published.
The proteome studies reveal that Annexin-1 is up-regulated during infection at the 24
hour time point. It is interesting to note that the gene expression studies suggest that
Annexin is up-regulated early, and then down-regulated later. This would suggest that
there is a large lag between changes in gene expression and the production of new protein,
however much larger data sets would be required to confirm and quantify this hypothesis.
The record in the SOURCE database [78] for Annexin suggests that it is involved with
exocytosis, membrane fusion and an anti-inflammatory response. The Swiss-Prot database
specifies that Annexin can be phosphorylated, leading to inactivation. The 2-DE data reveals
two adjacent spots that may be the result of differentially phosphorylated forms, which have
been further investigated (Section 6.3.4).
Down-regulated proteins
There are eight proteins that have been classified as down-regulated in the protein investiga-
tion and which have been matched to microarray data points. The apparent down-regulation
of the proteins actin beta and vimentin has been discussed above. The gene expression data
Chapter 6. Database support for proteomic studies of host-parasite interactions 204
suggest that vimentin is down-regulated as expected, but the results for actin beta are less
clear, although on average the gene for actin beta seems to be down-regulated. The function
of Cathepsin B was discussed in Section 6.3.2, and it appears that the microarray data sug-
gest the gene is slightly down-regulated early in infection, and very strongly down-regulated
late in infection. AHNAK (Desmoyokin) appears to be down-regulated in both the proteome
and microarray investigation. It is believed to have various roles, including signal transduc-
tion and regulation. The GenAtlas entry suggests that AHNAK plays “a regulatory role of
the actin-bound cytoskeleton to the l-type Ca2+ channel”, which would suggest that it may
be down-regulated as part of the inhibition of cell cycle and cytoskeletal development, caused
by the parasite.
There are several forms of the protein heterogeneous ribonucleoprotein F that appear in
higher volumes in infected samples, but other forms appear in lower quantities. The mi-
croarray experiments suggest that the gene is strongly up-regulated. The protein is involved
in RNA processing. It would be expected that more genes are expressed when the cell in
under stress, such as during infection. A general increase in gene expression should correlate
with higher RNA processing, and we might expect that heterogeneous ribonucleoprotein F
would be up-regulated. The finding that there are different variants of this protein may
suggest that an activated form of the protein is present in much higher volumes in infected
samples, and spots that are larger in non-infected cells correspond with a de-activated form
of the protein. Additional investigations into the PTMs of the protein would be required to
confirm this hypothesis.
The protein for dimethyl arginine dimethyl aminohydrolase appears in lower abundance
during infection in the proteomic study but the microarray data suggest that the gene has
fairly stable expression until the 24 hour time point, at which it is strongly up-regulated.
This protein has a catalytic role associated with the generation of nitric oxide generation.
While nitric oxide is used by macrophage cells to kill engulfed pathogens, nitric oxide is
unlikely to be used in this way in an HFF cell line. It is therefore difficult to hypothesise as
to why the protein appears in lower abundance. The protein superoxide dismutase exhibits
unusual results in this study, and is discussed below.
Superoxide dismutase
In general, there appears to be a reasonable correspondence between gene expression and
protein abundance, because most proteins that are found to be up-regulated in the proteome
Chapter 6. Database support for proteomic studies of host-parasite interactions 205
Figure 6.10: The top images display the spot identified as superoxide dismutase chain Afrom the non-infected sample, replicate 11 (left) versus infected (right). The lower imagedisplays superoxide dismutase. A polygon has been drawn on top of the image to displaythe likely position of the protein in the second gel. Gel images courtesy of M. Nelson.
Chapter 6. Database support for proteomic studies of host-parasite interactions 206
study have a corresponding gene expression value that is greater than one. In addition, most
of the proteins that are down-regulated have a corresponding gene expression value of less
than one. The one clear exception is superoxide dismutase, which is down-regulated in the
proteome, but strongly up-regulated in the microarray study. The Gene Ontology classifies
the protein as released in response to oxidative stress, which we would predict to be greater
during parasite invasion, therefore the result from 2-DE is surprising.
There are two spots on 2-D gels from non-infected samples, one predicted to match “su-
peroxide dismutase chain A” and another matching “superoxide dismutase”. The automated
comparison predicts that only the latter protein matches the microarray result in the Blader
study. A local alignment of the two protein sequences reveals that they have very low ho-
mology (35% similarity, alignment not shown), indicating that they are not highly related
proteins, even though they have similar names (GenBank accessions gi|515251 and gi|34711).
The diagram in Figure 6.10 displays the positions of the spots on the gels from infected and
non-infected samples. The top image displays superoxide dismutase chain A and the lower
image shows the position of superoxide dismutase, from infected (right) and non-infected
(left) samples. The microarray results demonstrate very strong up-regulation of superoxide
dismutase during infection. It is likely that in the proteome study “superoxide dismutase
chain A” is a different protein, and is not strongly down-regulated. Therefore, considering
only the lower image on Figure 6.10 (superoxide dismutase), there is no clear spot, or only
a spot with a far lower volume, in the infected sample. This result is surprising given the
suggested role of the protein, therefore further analysis is required to verify that superoxide
dismutase is down-regulated during infection in the proteome but up-regulated in the tran-
scriptome. If this proved to be correct, this would demonstrate strong post-transcriptional
control regulating protein abundance, because a large increase in gene expression does not
appear to produce a corresponding change in protein abundance.
In summary, the results of the comparison between microarray and proteomics highlight
the potential for discovery of the relationship between gene expression and protein abun-
dance when larger data sets are assembled. The study reveals several proteins that correlate
well with gene expression values. The examples presented in this section demonstrate that
information about the proteins’ functions can be assimilated easily within RAPAD, due to
the number of links to external databases which are provided.
Chapter 6. Database support for proteomic studies of host-parasite interactions 207
Figure 6.11: Four spots containing protein disulphide isomerase. The pattern of spots isindicative of different phosphorylated forms of the protein. Gel image courtesy of M. Nelson.
6.3.4 Post-translational modifications
The database query facility and the Gel Viewer enable researchers to find proteins that lo-
calise to the same region on the gel, and share the same name. This can highlight potential
post-translational modifications for further enquiry. An example is shown in Figure 6.11
of four spots matched to protein disulphide isomerase, a protein that catalyses the rear-
rangement of sulphide bonds in proteins. The pattern of several spots in a horizontal line
is characteristic of different phosphorylation states, although other types of variable modifi-
cations can produce clusters of spots. It is also possible that differential splicing occurs to
produce various different protein sequences from a single gene.
Mass spectrometry data was used primarily to identify proteins, however, a process was
undertaken to search the MS data again, to find variable modifications on the proteins.
The MASCOT software has an option to search for different types of modifications, such
as phosphorylation, acetylation, and others, to find if the mass of each peptide detected,
matches more closely a peptide sequence if one of the residues has a particular modification.
The search was implemented for clusters of spots that match the same protein (Vimentin,
PDI and Annexin). However, the searches revealed little information about modifications.
Chapter 6. Database support for proteomic studies of host-parasite interactions 208
Start - End Observed Mr(expt) Mr(calc) Delta Miss Sequence
Figure 6.12: The result of a search for potential post-translational modification of proteindisulphide isomerase, revealing a peptide that may be acetylated and phosphorylated. Theoxidations are caused experimentally and are not biologically relevant.
There are several possible reasons: firstly, the number of peptides detected by MS is usually
far smaller than the total number of peptides in a protein, and only a proportion (10-
40%) of the peptides are actually detected. Therefore, in many cases the modification is
to a peptide that is not detected by MS. Secondly, it is believed that peptides with certain
modifications do not ionise well, and are therefore less likely to be detected than peptides
without additional modifications. Finally, it is possible that the cluster of spots is the result
of several different translations of the same gene, to produce a set of proteins that contain
peptides that still match the sequence entry in the database. The searches revealed a single
possible modification to the PDI spot at the furthest right position in Figure 6.11, which is
predicted to have been acetylated and phosphorylated (Figure 6.12). A phosphorylation to
a protein could be confirmed by a labelling experiment to quantify the number of phosphate
residues per protein, in each spot. There are facilities in RAPAD for the storage and querying
of PTMs after they have been confirmed, as described in the previous chapter.
6.3.5 Public access to data
An interface has been created that will allow public access to the proteomic data to ac-
company a future journal publication. The opening page loads a general description of the
experiment, a summary of all the gels, a listing of the number of proteins identified on each
gel, and links to the protocols for the protein solubilisation, first and second dimension sep-
aration, staining and scanning (Figure 6.13). There is an option to select particular gels,
and view a table containing the proteins that have been identified. The second page allows
users to select particular proteins, and open the Gel Viewer highlighting the proteins, with
different gels appearing in separate tabbed windows. The security of data is ensured because
a check is made before loading each page that data has been specified as publicly accessible
for every gel and protein (every database table has the attribute OTHER READ which is set
Chapter 6. Database support for proteomic studies of host-parasite interactions 209
from 0 to 1 for public data). At the time of writing, the researchers do not wish to release
the data until it has been published elsewhere, therefore the URL for this part of the inter-
face will accompany the publication of the data. The query interface that forms part of the
RAPAD Study-Annotator, described in the previous chapter, will be linked to the publicly
accessible data sets. This will allow other researchers to verify the findings, and opens the
possibility for new discoveries by allowing complex queries of the data.
6.4 Discussion
The experiments described in this chapter present challenges due to the size and complexity
of the data. One of the major challenges is the requirement for summarising data across
replicates, and determining if proteins are differentially expressed during parasite invasion.
The biological goals were to investigate if proteomics experiments confirmed or conflicted
with previous hypotheses regarding the mechanism of parasite invasion, and the continued
survival of the parasites in host cell culture. This has been facilitated by the development of
software for matching spots between gels, and visualising differentially expressed proteins.
The Gel Viewer enables multiple gels to be loaded concurrently, with controls for zooming,
search facilities for highlighting particular spots and links to more detailed information in
the database. There are also query facilities for finding particular proteins in the database
and for summarising all the data across replicates. Software has been written to connect
RAPAD to a number of external databases and analysis applications that summarise func-
tional classifications, to find which classes of proteins change in expression during parasite
invasion. The ability to connect to external software demonstrates the flexibility of RAPAD
which is due to the extensive use of ontologies. External database entries are stored in a
generic table (DatabaseEntry), with a record stored in the OntologyEntry table that has
sufficient information for capturing how the link to the external database should be imple-
mented, capturing the database’s URL and version. Therefore, external links to any web
accessible database can be provided.
This investigation allows the general functionality of RAPAD to be assessed in a genuine
research environment. It is common for proteomics investigations to require the display of
differential expression of proteins, and links to external Internet accessible databases. The
data set described in this chapter is fairly large (14 gels, 350 identified proteins), and in the
following chapter there is a description of a different study in which a further 1000 protein
identifications are stored in RAPAD. This demonstrates that RAPAD can scale up to manage
Chapter 6. Database support for proteomic studies of host-parasite interactions 210
Figure 6.13: A summary page displays all the gels present in the experiment, and a linkexists to display the experimental protocols used for each gel.
Chapter 6. Database support for proteomic studies of host-parasite interactions 211
substantial data sets, and allows them to be queried. The interface code and database schema
are freely available, therefore other developers can re-create their own version of RAPAD to
support a variety of proteome studies.
The integration between the locally generated proteomics data and the previously pub-
lished microarray studies was a critical requirement of the project. The results demon-
strate the viability of the approach, however currently there are few protein records that
are matched to microarray data points. This appears to be a reflection of the proportion of
records present in both studies, rather than a flaw in the methodology. The microarrays used
by Blader contained from 18,000 to 27,000 clones. However, the results were only reported
for those genes that showed a 2-fold difference in fluorescence between scans generated from
infected and non-infected samples, corresponding to approximately 1800 microarray results.
In the proteome study, 130 distinct proteins displaying differential expression have been iden-
tified, of which 14 have been matched to a clone in the microarray study. This is about 1 in
9 proteins that match a microarray clone. It would be expected by chance that a minimum
of 1 in 15 proteins identified by 2-DE should match a clone in the microarray results (1800
results from 27000 clones = 1/15). In reality, we would expect a far higher number to corre-
spond in the two studies because protein spots have been selected for analysis if they appear
in different volumes across the two conditions. It is assumed that if a protein is produced in
much greater abundance, there would be a corresponding increase in the mRNA levels that
would be detected by microarray analysis. Therefore, it might be predicted that most of the
proteins identified in the proteome study should appear in the Blader results.
There are a large number of the 130 proteins found in this study that do not match any
differentially expressed genes in the microarray study. It is possible that the Blader study
did not have complete coverage of all genes, but the majority of genes were assayed and were
found to have stable gene expression between infected and non-infected samples. Therefore,
the differentially expressed proteins that do not match anything in the Blader study are of
interest, because they demonstrate that there may be post-transcriptional control in response
to parasite invasion. In other words, many proteins are produced in greater or lesser volume
during infection that do not have a measurable difference in their mRNA levels. It is possible
that certain proteins that are required for infectivity would not be highlighted from a gene
expression experiment, and in that case the mechanisms for infectivity cannot be studied
using microarrays only. This finding demonstrates the viability of the 2-DE and MS approach
for hypothesis formation, and it is likely that it will continue to grow as a technology for
Chapter 6. Database support for proteomic studies of host-parasite interactions 212
functional genomics analysis.
6.5 Summary and conclusions
In this chapter software has been described that enables clustering and visualising spots on
replicate gels that contain the same form of a protein, and the spots that contain variant
forms. This has enabled potential post-translational modifications to be identified for fur-
ther study. When PTMs have been confirmed, RAPAD has facilities for their storage and
querying. The results suggest that different forms of proteins exist in infected and non-
infected samples, although the exact types of the modifications have yet to be confirmed.
The data sets will continue to grow rapidly, and it will be vital to combine information about
modifications with the relative expression values measured by microarrays, 2-DE and other
technologies. RAPAD provides a framework in which this kind of data integration can take
place on a large scale, and it will serve as a repository for the publication of data to accom-
pany journal articles. It is planned that the data from the experiments with Toxoplasma
will be published at some point in the future. RAPAD will provide public access to the
data, using the interface described in the previous chapter, to allow researchers accessing the
article to query the proteome data.
A common type of proteome investigation is the search for differentially expressed pro-
teins, using 2-DE, image analysis and mass spectrometry. The RAPAD system has been
extended to support the experiments presented in this chapter, which compare a human cell
line, invaded with Toxoplasma gondii, with non-invaded cells. RAPAD specifically facilitates
the identification of differential expression by providing a visualisation of clusters of spots
that have been matched to the same protein across a series of replicates. Following the
identification of proteins, a large amount of information must be assimilated from diverse
databases to characterise the proteins. Every protein record in RAPAD has hyperlinks to sev-
eral other databases, using the GenBank identifier or the corresponding gene symbol, which
were obtained for each protein using scripts written by the author. Additional tools were
used to summarise the functions of proteins from the Gene Ontology. An approach has been
presented for matching differentially expressed proteins to the corresponding results from a
previously published microarray experiment. The results of the matching demonstrate some
correspondence between genes that are up-regulated during infection and increased protein
abundance on 2-D gels, but the data sets are not currently large enough to quantify the cor-
relation. The software can be re-used when data sets are larger for determining the global
Chapter 6. Database support for proteomic studies of host-parasite interactions 213
rate of transcription and translation.
The following chapter outlines a project with a different parasite, Trypanosoma brucei.
RAPAD assists an investigation to catalogue all the proteins that can be found using a
gel-based approach, to improve the functional annotation of the genome, and determine the
dynamic nature of the proteome.
Chapter 7
Software support for a proteome
map of Trypanosoma brucei
7.1 Introduction
The previous chapter focused on the use of proteomics techniques to find differentially ex-
pressed proteins that allow for the formation of new hypotheses about the function of a
system. This chapter outlines the use of proteomics in a different context, where it is used
for cataloguing information about protein expression, to improve the functional annotation
of genes and the search for post-translational modifications. The RAPAD database sup-
ports a proteome map of the parasite Trypanosoma brucei that causes sleeping sickness in
Africa. The genome sequence of T. brucei is nearing completion from which many open
reading frames have been accurately predicted, but the functional annotation of the genes
is generally poor. There are many genes that have only been tentatively identified and have
no functional assignment. The proteome data is able to confirm the existence of genes that
encode proteins expressed in the cell line and provide insights into the dynamic nature of
proteins in terms of modifications, and different isoforms that exist. Additional software
has been written to provide a novel visualisation of proteins identified by mass spectrome-
try, and to summarise information within a substantial data set. The analysis presented in
this chapter will improve the naming of certain genes, and provides a potential functional
assignment for several proteins.
7.1.1 The biology of trypanosomes
Trypanosoma brucei is a eukaryotic parasite that causes sleeping sickness in sub-Saharan
Africa, and there have been a number of recent epidemics [294]. Trypanosomes live in
the bloodstream and tissue fluids of mammals, causing a variety of diseases in livestock and
214
Chapter 7. Software support for a proteome map of Trypanosoma brucei 215
Figure 7.1: The life cycle of Trypanosoma brucei, from DPDx - CDC Parasitology Diagnosticweb site, http://www.dpd.cdc.gov/dpdx/HTML/TrypanosomiasisAfrican.asp
mortality in humans. They are transmitted by tsetse flies, and it is predicted that more than
half a billion people live in affected areas, with hundreds of thousands of new cases per year
[26]. The expected outcome, in the absence of chemotherapy, is death. Anti-trypanosomal
drugs have been developed, although drugs are not 100% effective, and resistant strains are
now arising [301].
The prospects for the development of a vaccine are very slim because the parasite evades
the immune response through the process of antigenic variation, first reported by Vickerman
in the 1960s [335]. A set of proteins, known as variant surface glycoproteins (VSG), form
a dense outer layer around the parasite, protecting against recognition from the immune
system. There is one locus from which a single VSG gene is activated at any one time,
with approximately 1000 other VSG genes distributed in different, silenced positions. At
intervals, a rearrangement of the genes occurs, switching the gene that is positioned in the
activated locus. A different protein becomes expressed, forming a new surface coat that will
not be recognised by the immune system (the mechanisms of gene switching are reviewed by
Barry 1997 [27]).
Trypanosomes undergo a complex developmental cycle that is simplified in Figure 7.1.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 216
Figure 7.2: An electron micrograph of the bloodstream form of Trypanosoma brucei, fromhttp://www.ulb.ac.be/sciences/biodic/ImProto0003.html
The regulation of the life-cycle is poorly understood despite its obvious importance to the
parasite. When a fly takes a blood meal from an infected mammalian host, bloodstream
forms (Figure 7.2) differentiate to the procyclic stage of the life cycle in the gut of the fly,
accompanied by alterations in metabolism and morphology caused by changes in expression
of an unknown number of proteins. It is vital these proteins are identified given the severity
of the disease and the unusual biology of trypanosomes, which is discussed in more detail
below. It is also possible that proteins involved in regulating the life-cycle may prove to be
viable drug targets.
7.1.2 Annotating the genome
The genome sequence of T. brucei is nearing completion and the sequence of chromosomes I
and II was reported in 2003 [146, 87]. The genome contains 11 chromosomes in total, and is
27 Megabases in length. Currently, 5500 coding sequences have been conclusively identified
(March 2004) [127], and it is expected that the total gene number will be about 8000. Efforts
are now underway to determine the function of all genes, with particular focus on genes that
cause drug resistance, genes that enable the parasite to evade the immune response and the
proteins that are up-regulated during infection of mammals. Trypanosoma brucei belongs
to a small class of unicellular organisms, the kinetoplastids, which exhibit highly unusual
Chapter 7. Software support for a proteome map of Trypanosoma brucei 217
regulation of gene expression. It seems that these organisms do not regulate transcription
by RNA polymerase II, and large numbers of genes appear to be regulated from a single
transcriptional initiation point. The genes lie adjacent to each other in long runs, interspersed
with almost no introns, similar to bacterial operons [60]. However, unlike operons, the
genes do not encode similar proteins that would be expected to be under a single control
mechanism, but instead contain seemingly unrelated genes. It will therefore be interesting
to discover what functional genomics (FG) experiments can demonstrate about how genetic
regulation is performed in these parasites. Microarray analysis would be expected to reveal
unusual results because transcriptional control may occur only through regulation of the
rate of degradation of mRNA, or the rate of splicing. Therefore, the abundance of mRNA
may have different patterns from organisms with conventional gene regulation. Proteomics
studies aim to determine the level of expressed proteins and therefore may prove vital in
elucidating how post-transcriptional control is exerted.
It is essential that the functional annotation of the T. brucei genome is improved rapidly,
and made widely available, to facilitate the search for new drugs to control sleeping sickness.
There are also several related species that cause serious diseases. One of the closest relatives is
Trypanosoma cruzi that causes Chagas disease in South America. The parasite is transmitted
by triatomal bugs, infects mainly cardiovascular and autonomic nervous tissues, and is fatal
in about a third of all cases [53]. There are several members of the genus Leishmania, which
cause a variety of life-threatening diseases in the third world. Genome sequence is taking
place on T. cruzi and Leishmania major. Comparative genome studies must be performed
to ensure that any gene annotations for closely related species can be related back to newly
sequenced genes in other organisms.
7.1.3 Database support
RAPAD is supporting a project to generate a catalogue of all the expressed proteins from T.
brucei, which can be separated by two dimensional gel electrophoresis (2-DE) and identified
by mass spectrometry (MS). The experiments are being performed by Anne Faldas and Prof.
Mike Turner in the Institute of Biomedical and Life Sciences at the University of Glasgow,
and the biological data in this chapter is reproduced with their permission.
Many of the 8000 genes in the genome are annotated as “hypothetical proteins” because
they been identified solely by gene prediction algorithms. A naıve search of the genome
database, GeneDB [127], for the annotation “hypothetical AND protein” in T. brucei pro-
Chapter 7. Software support for a proteome map of Trypanosoma brucei 218
duces a list of 11,999 entries, for which there is little or no further annotation. Several
entries must refer to the same underlying gene, but appear more than once in the database,
because this number is far larger than the expected total number of genes. Clearly, if a
protein is identified conclusively by mass spectrometry, the protein is a real sequence, and
is expressed under the conditions used to generate the sample. This information must feed
back to the genome curators to allow the annotation to change from “hypothetical protein”
to “confirmed protein”. If homologous sequences from other organisms have been found by
similarity searches, the functional assignment of the homologous sequence should also be
added as annotation (described in Section 7.3.2).
RAPAD supports searching and filtering of proteome data, allowing complex Boolean
queries to mine specific information from large data sets. It is also important that protein
data arising from gels with different pH ranges is combined in an intuitive manner, requir-
ing the development of good visualisation tools. This facility in RAPAD was described in
the previous chapter. One of the most important parts of the analysis is to discover the
frequency of post-translational modifications, or other events, which cause multiple spots,
matched to the same protein to appear on a gel. Many proteins appear in multiple copies at
different positions on the gel, indicating that some processing or alteration of proteins must
be occurring to change either the charge or mass. For example, 92 distinct spots contain a
tubulin protein (α or β), many of which appear near the base of the gel, indicating small
molecular weight proteins, and the spots are reproducible across replicate gels. This would
suggest that the spots contain only fragments of proteins, the result of degradation. Software
has been developed alongside RAPAD to investigate this phenomenon (Section 7.3.1).
7.1.4 Project status
The current status of the T. brucei data deposited in RAPAD is as follows (June 2004).
There are 955 proteins identified in total, which arise from 619 spots on three gels. The
number of proteins is higher than the number of spots because several different proteins
are frequently identified from a single spot. A database query reveals that 260 proteins
have distinct molecular weights, indicating that this is the approximate number of different
proteins that have been identified. The rest of the analysis has been performed on one single
master gel (pH range 4-7), which contains 879 distinct spots. On the master gel 753 protein
identifications have been made from 460 spots.
The rest of the chapter is structured as follows: the methods used to capture the project
Chapter 7. Software support for a proteome map of Trypanosoma brucei 219
requirements and to develop the software are discussed in Section 7.2. Section 7.3 describes
the results, in terms of how RAPAD supports the discovery of modifications and aids genome
annotation. An investigation into the causes of multiple spots arising for a single protein is
also described. Discussion is provided in Section 7.4.
7.2 Methods
7.2.1 Generation of samples for proteome analysis
One of the major problems of performing functional genomics analysis on trypanosomes is
the speed with which they evolve, and it has been reported that trypanosome lines can spon-
taneously change their phenotype as a result of laboratory manipulation (see for example
van Deursen et al. 2001 [329]). If researchers perform investigations to characterise the
gene or protein expression of trypanosomes, the results may only have relevance to the exact
laboratory strain on which the experiments were performed. To alleviate these problems, a
reference strain of T. brucei has been generated (TREU 927), which has been used for gen-
erating the genome sequence [329]. The strain has several properties that are representative
of trypanosomes in the wild, and it can be cultured in vitro. Proteins have been extracted
from procyclic forms of the TREU 927 line grown as an in vitro culture for the proteome
study in Glasgow. This is vital because DNA has also been extracted from this line for
microarrays that are being created. Therefore, it will be possible to compare data from
the genome, transcriptome and proteome in the future, and the proteome experiments can
directly contribute to improving the annotation of the genome. The proteomics experiments
in the database comprise three main gels which have been run over different pH ranges (4-7,
6-11, and 4.5-5.5) to achieve a high resolution of proteins. The experimental protocols for
protein solubilisation, the two dimensions of gel separation and staining are all stored in
RAPAD. The details of the experimental procedure are given below.
Procyclic forms of the genome reference strain TREU 927/4 were grown in SDM-79
with 10% foetal calf serum according to [330]. Parasites were purified by washing in PSG
buffer and centrifuged at 13,000g. Approximately 2x108 trypanosomes (650 µg protein) were
and 0.5% IPG buffer pH4-7, trace bromophenol blue). A protease inhibitor cocktail (5µl,
Roche), at a concentration of 25µg/ml, and 10µl nucleases (2000 units/ml DNase, 1750
units/ml RNase A, 50mM MgCl2) were added to limit proteolysis and digest nucleic acids.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 220
The sample was incubated at room temperature for 1 hour, vortexing every 10 minutes, then
freeze/thawed in liquid nitrogen. The sample (450µl) was loaded on to a 24 cm IPG strip
(Amersham) and isoelectric focusing was performed, reaching more than 70,000Vhrs.
The strips were equilibrated in 100mM DTT for 15 minutes followed by 15 minutes in
250mM α-iodoacetamide before being applied to a 12.5% precast SDS polyacrylamide gel.
Electrophoresis ran over night at 150C using the Amersham buffer kit. The gels were stained
using colloidal Coomassie dye and scanned using Image Master (Amersham). Replicate gels
were performed (ten replicates of pH 4-7, five replicates of 4.5-5.5 and 6-11) of which one was
selected for protein identification. The 2D Elite software (Amersham) was used to generate
a picklist, and the gel was transferred to the Amersham robotic workstation, each gel plug
digested with trypsin and mixed with a CHCA (α-cyano-4-hydroxy cinnamic acid) matrix,
and spotted on to a MALDI (Matrix Assisted Laser Desorption Ionisation) target plate. A
peptide sample and a gel plug were collected for each sample and stored at −200C. Analysis
of the peptides were performed using MALDI-TOF (Time Of Flight) with a Voyager system
(Perseptive Biosystems) and tandem MS (AB Q-Star Pulsar). Tandem MS was used for the
majority of protein identifications (approximately 95%). Genome sequence information was
downloaded from GeneDB (Release 3) to a local database that was searched using MASCOT
software [207]. Proteins were positively identified at a significance value of P < 0.05 as
calculated by the software.
7.2.2 Project requirements capture
The first phase of developing an understanding of the problem area involved meetings with
the project leader and researchers working on trypanosomes. The current practice of man-
aging data was observed. This consists of the data from the project being stored in Excel
spreadsheets. Data was entered into the spreadsheet by manual copy and pasting from mass
spectrometry results and database searches that had been performed to characterise pro-
teins. Protein data in the Excel spreadsheet was related back to the spot on the 2-D gel
from which it arose, using the numerical identifier assigned to the spot by the image analysis
application, which was entered in the corresponding row of the table.
The project leader, Prof. Mike Turner, outlined a set of six questions that could poten-
tially be solved by improvements in software:
1. Can the time and labour to identify proteins be reduced?
2. How many different proteins can be identified from 2500 spots?
Chapter 7. Software support for a proteome map of Trypanosoma brucei 221
Protein unfoldsduring 2−DE
Digested into peptides
Peptides detected by MS
Peptide span ofwhole sequence
Folded protein
Figure 7.3: The span of peptides that have been matched within a protein sequence arerepresented by the shaded section of the block, for a cluster of four spots, explained inSection 7.2.3.
3. How widespread and common are post-translational modifications?
4. How can we improve the T. brucei genome annotation?
5. Can we build a “point and click” virtual 2D gel?
6. Can we build pages that give original MS data interpretations?
The issues of genome annotation and data integration were discussed in meetings with the
curators of the T. brucei genome database at the Sanger centre, Cambridge UK (December
2003). The web site providing access to the genome is GeneDB, which is supported by the
GUS database system. One of the main goals of the proteome project is to improve genome
annotation. Once the proteome namespace has been added to GUS, as discussed in Chapter
5, the proteomics data can be stored directly within GeneDB. However, it is important that
data produced from the experiments can be linked up with GeneDB in the near future, prior
to the full deployment of a new version of GUS that supports proteomics. Towards this
goal, a new interface has been developed as part of RAPAD for publishing data, with unique
identifiers that can be linked up with GeneDB, when the proteomics data is made public.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 222
7.2.3 Visualisation
There are many different spots that have been identified as the same protein on the master
gel in this investigation. The database can be queried for a particular protein name, and
the results of the query can be visualised in the Gel Viewer. The Gel Viewer provides a link
from each spot to the record for the mass spectrometry results that were used to identify
the protein. However, there are limited facilities for investigating why so many different
spots arise that appear to match the same protein. Therefore, additional software has been
implemented alongside RAPAD for visualising the peptide sequences that have been matched
by MS data, to investigate why certain proteins appear in multiple positions on a 2-D gel.
A piece of text processing software has been written to extract the peptide sequences from
mass spectrometry results. The full length sequence has also been obtained for each protein,
and linked up to the Gel Viewer to provide a visualisation for every spot, displaying the
proportion of the protein sequence that has been matched: the span of peptide hits (Figure
7.3). Each spot is labelled with a white block representing the entire protein sequence, filled
with a shaded section. The left end of the shaded block represents the position of the first
peptide hit in the protein sequence, the right end of the shaded block represents the last
position of the last hit to the protein sequence. From this information, it is possible to say
that at least this proportion of the protein sequence was present in the spot, assuming correct
identification from MS data. Peptides may not be detected by MS for several reasons: (i)
during MS/MS only a proportion of the peptides most strongly detected in the first stage
are subjected to the second stage of MS, (ii) ionisation is dependent on various properties of
a peptide, such as its charge and (iii) there is technical variability in the efficiency of peptide
ionisation.
The genome database contains several different genes that share the same name. An
additional visualisation has been created to summarise where these different forms of the
same protein arise on the master gel. A different colour is used to shade spots that have
been matched to peptides within a specific protein sequence in the database. In this way,
groups of proteins that have the same name but are in fact different, can easily be visualised
on the gel. This allows researchers to verify that clusters of proteins with the same name
have been identified correctly, because it is expected that proteins located in the same region
of the gel will arise from the same gene.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 223
7.3 Results
7.3.1 Investigation into multiple protein forms
The proteomics experiments on T. brucei reveal several proteins that appear in multiple
positions on a single gel (the pH 4 - 7 gel), examples include Heat Shock Protein 70 (62 spots),
α-tubulin (50 spots), β-tubulin (40 spots), Elongation factors (EF 1-α, EF 1-β, EF 1-γ, EF
2; creating 37 spots in total) and Heat Shock Protein 60 (19 proteins). There are several
reasons why proteins may appear in multiple positions. Firstly, chemical modifications, such
as the gain or loss of phosphate groups on the protein, can cause multiple spots to appear in
a localised region. Secondly, a protein may also be fragmented at some point, either in vivo
or during the experimental procedure, therefore peptides measured by mass spectrometry
may not have arisen from the full protein sequence. Protein spots that arise near the bottom
of the gel, indicating low molecular weight proteins (described on page 7 in Chapter 1),
are more likely to contain only fragments of proteins . Thirdly, it is formally possible that
differential splicing causes different proteins to be produced from the same gene, which still
have peptides that match the protein entry in a sequence database, even if the full length
sequence of the protein is different from the predicted form. However, while differential
splicing in higher eukaryotes seems to be a very common phenomenon [215], it has never
been reported in T. brucei because almost all genes comprise a single exon and therefore
are not spliced at all. It is also possible that the proteins which seem to appear in multiple
copies are false positives, arising because the sequences have some characteristic that causes
many incorrect database matches.
Tubulin proteins
α and β-tubulin produce many spots on the 2-D gels for T. brucei, which could be the
result of protein modifications. α and β-tubulin form a heterodimer and are one of the main
components of microtubules that form a layer around the cytoplasm, just beneath the outer
cell membrane [140]. A study by Lubega and colleagues demonstrated that mice can be
immunised against African trypanosomosis by injection with tubulin proteins, raising the
possibility that tubulins could form part of a successful vaccine [197].
It has previously been demonstrated that post-translational modifications (PTMs) of
tubulin are associated with the construction of the cytoskeleton and fall into two categories:
general protein modifications, such as phosphorylation or acetylation, and tubulin-specific,
Chapter 7. Software support for a proteome map of Trypanosoma brucei 224
1)
2)
3)
4)
Figure 7.4: Protein spots matched to β-tubulin, overlaid with a graphic displaying the spanof peptide hits (shaded block) as a proportion of the full length sequence (white block). Theboxed regions are discussed further in the text. Gel image courtesy of A. Faldas.
such as tyrosination. The acetylation of tubulin has previously been identified by 2-DE,
therefore many of the spots observed in this study are likely to correspond to differentially
modified forms of the protein (original experiments are reviewed by Gull [140]).
β-tubulin
The results from the peptide alignment analysis with β-tubulin are displayed in Figure 7.4.
The main cluster of proteins (1 on Figure 7.4) towards the top of the gel is in the position
that would be predicted by the molecular weight of β-tubulin (50KDa). It is likely that there
are several different types of chemical modifications that occur to β-tubulin, causing the 16
different spots to appear in this region. The spots at the bottom left of the gel (4) have fairly
Chapter 7. Software support for a proteome map of Trypanosoma brucei 225
short spans of peptide hits (less than 10% of the full sequence), therefore are more likely to
be caused by peptide fragments. In the bottom middle range of the gel there are two spots
(3) both with peptides matching a range in the middle of the protein sequence, indicating
these two are caused by two similar protein fragments, possibly with a single modification
causing a localised shift in position.
There is a cluster of several spots in the middle/left of the image (2 on Figure 7.4),
which appear to have very long peptide spans (up to 80%). This result is surprising because
it would not be expected that the full length protein sequence for tubulin would migrate this
far into the gel. Therefore, it is theoretically possible that this protein arises from differential
splicing of gene products to produce a protein that has peptide sequences from the two ends
of the original sequence. It is also possible that the spots contain a different protein that
has peptides that closely match parts of the β-tubulin sequence. However, a BLAST [11]
search of GeneDB with the peptides from these regions reveals that there are no similar
sequences except the other tubulin proteins (BLAST results not shown). The MS data for
the close groupings of three spots (spot ID 677, 664 and 641) have very high MASCOT scores,
indicating that the matches are probably correct, with strong hits to peptides near the start
of the sequence, and other matches to peptides near the end of the protein sequence. GeneDB
contains a cluster of identical genes on chromosome 1, annotated as β-tubulin, although the
exact number of genes is not known because it varies in different cell lines. It is also very
difficult to assemble regions of the genome that contain repetitive identical sequences. There
are no gene sequences deposited in GeneDB that could explain the long span of peptide hits
of this spot cluster.
A further observation on Figure 7.4 is that the peptides matched tend to cluster at the
N-terminus (left end) and there are no peptides matched to the C-terminus (right end) of
protein sequences. This raises the possibility that there is cleavage of a peptide at the C-
terminus. Alternatively, it is possible that there are modifications that prevent peptides
being ionised in a mass spectrometer. In particular, it is known that the C-terminus of
β-tubulin is extensively glutamylated, which is the addition of up to 20 extra glutamate
residues to a defined glutamate near the C-terminus [283]. This may prevent the peptides
at the C-terminus from being detected by MS.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 226
1)
2)
4)5)
3)
Figure 7.5: Protein spots matched to α-tubulin, overlaid with a graphic displaying the spanof peptide hits. There is a correlation between the span of peptide hits and the position ofa spot on the gel. Gel image courtesy of A. Faldas.
α-tubulin
Figure 7.5 displays the peptide spans for α-tubulin. The cluster of six spots towards the top
of the gel (1) are in the position that would be expected by a protein with the molecular
mass of α-tubulin (50KDa) and therefore probably contain the full length sequence. There
is a cluster of spots presumably caused by various small modifications to the protein, which
account for the localised shifts in positions. The genome contains a cluster of identical α-
tubulin sequences on chromosome 1, therefore the different spot positions are not due to
differences in gene sequence.
At the bottom of the gel there are a large number of possible fragments, and there appears
to be a fairly strong correlation between spots located in the same region and the span of
peptide hits (see for example 2, 3 and 4 on Figure 7.5). This would suggest that a fragment is
Chapter 7. Software support for a proteome map of Trypanosoma brucei 227
being produced reproducibly with one or two different modifications on the peptides present
in the fragment. The volume of spots in the small molecular weight range also appears to
be reproducible across replicate gels by manual inspection. However, it is not possible to
investigate the peptide spans of all spots from replicate gels by MS due to the cost involved.
It remains to be investigated if these fragments have any biological significance or if they
are experimental artifacts. The correlation between peptide span and spot position may be
related to protein modifications. Modification status affects the ability of a peptide to be
ionised, therefore peptides that have the same set of modifications should have the same
probability of being detected by mass spectrometry. Proteins located in similar regions are
likely to contain many peptides that have been modified in the same way, and these peptides
will share the same likelihood of being detected by mass spectrometry.
There are two spots towards the bottom left that have very long spans of matched
peptides (5). This is similar to the results for β-tubulin, and it is unlikely that a full length
protein could migrate this distance in the gel, therefore these may be the result of differential
splicing. An alternative, although unlikely, possibility is that tubulin fragments from the two
ends of the protein have independently co-migrated and appear as a single spot. It is also
possible that the protein fragmented but the 3-D structure did not completely disassociate
as expected, leaving different parts of the protein bound together, with a small overall mass.
The spots (IDs 741 and 734) both have strong hits to the α-tubulin protein record, matching
peptides near the beginning and end of the protein sequence. Additional experiments could
be performed to further characterise this protein spot, for example performing tandem mass
spectrometry on as many peptides as possible to determine what parts of the protein are
present in the spot.
The same observation about the lack of peptides matched at the C-terminus can be made
for α-tubulin, as well as β-tubulin. This may be due to glutamylation, which has also been
reported for α-tubulin [84], or tyrosination of C-terminal peptides [289]. Modifications of
these kinds are thought to be common on α-tubulin, and may prevent peptides becoming
ionised during MS. The peptide spans on Figure 7.5 also demonstrate that there are no
N-terminal peptides that have been matched. This raises the possibility that PTMs also
occur on N-terminal peptides, which as far as we aware has not been previously reported.
This demonstrates that the peptide visualisation software has the capacity for hypothesis
generation, which can be confirmed by further experimentation.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 228
Figure 7.6: Protein spots matched to five different Elongation Factors. EF-α (blue); EF-β(red); EF-2 (yellow); EF-γ (orange and boxed); EF (putative) (white). Gel image courtesyof A. Faldas.
Elongation factor proteins
The peptide alignment analysis has also been performed to classify Elongation Factor (EF)
protein spots. Elongation factors function during protein translation, for example controlling
the addition of new amino acids onto a growing peptide chain. It has been suggested that T.
brucei protein abundance is controlled at the level of translational rather than transcription,
therefore any insights into EF proteins could prove important in understanding regulation.
There are at least five different elongation factor genes, with many spots appearing on the
2-D gel (Figure 7.6). Functional annotation for these genes in T. brucei is still at an early
stage, therefore any information from proteomics that can aid annotation will be useful.
An analysis was carried out to determine the peptide spans of EF 1-α, EF 1-β, EF-2, a
sequence annotated in the database solely as EF (putative), and EF-γ (one protein spot -
peptide alignment not shown), to test whether spots have been correctly identified on gels,
and to determine whether sequences have been correctly predicted in the genome database.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 229
Figure 7.7: Protein spots matched to Elongation factor 1-α. Gel image courtesy of A. Faldas.
Elongation factor 1-α
The graphic for Elongation factor 1-α (Figure 7.7) displays a large cluster of spots to the
right of the gel, likely to be caused by multiple differentially modified forms of the proteins.
The post-translational modification of EF 1-α is a common phenomenon in other organisms,
such as plants [265], but as far as we are aware, it has not been investigated in detail for
trypanosomes. The evidence presented here suggests that PTMs to EF 1-α from T. brucei
are also very common. The spots towards the bottom of the gel are likely to be protein
fragments, shown by the very short spans of peptides (less than 5% of the sequence length).
Elongation factor 1-β and EF (putative)
The left gel in Figure 7.8 displays Elongation factor 1-β (EF-β) protein spots. There are
three spots in the middle of the gel, which are likely to result from different modifications,
such as different phosphorylations to the protein. A single spot towards the bottom of the
gel is probably a fragment of the full length sequence. The right image on Figure 7.8 displays
the spots matched to EF (putative). There is probably one match to the full protein, in the
centre of the gel, and two possible fragments at the bottom of the gel. A multiple alignment
has been performed, using ClustalW [318], of the sequences of EF-β and EF (putative) from
Chapter 7. Software support for a proteome map of Trypanosoma brucei 230
Figure 7.8: Protein spots matched to EF-β and EF (putative) are displayed with the corre-sponding span of peptide hits. The boxed regions mark a spot that contains peptides thatmatch both EF-β and EF (putative). A multiple alignment is also displayed of EF-β fromT. brucei and T. cruzi, with EF (putative) from T. brucei and EF T. cruzi. The boxedregion of the alignment shows that the starting codon of EF-β from T. brucei may havebeen wrongly predicted. Gel images courtesy of A. Faldas.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 231
T. brucei and from T. cruzi (in the lower part of Figure 7.8). There is a very high degree of
sequence similarity between EF-β and EF (putative), with long stretches of identical residues.
In T. brucei the sequences lie on chromosome 4 and chromosome 10, therefore there is a low
chance that this is an annotation error and they are in fact the same sequence. However, it
is known that contamination has been detected in sequences derived from the chromosome
10 project, and therefore it is not possible to say definitively that the two sequences arise
from different genes.
The alignment shows that the N-terminus of EF-β may have been incorrectly predicted
because the first 30 or 40 residues align poorly, and there is a region 37 residues downstream,
which matches the start of the other EF sequences. It is also worth noting that the first
residue of the T. cruzi EF sequence is not a methionine and may also have been incorrectly
predicted. There is a methionine nine residues downstream that aligns very well with the
start codon of EF (putative) from T. brucei, which is more likely to be the correct start
position.
The alignment of peptide sequences against proteins reveals a single spot that contains a
peptide that exactly matches the protein sequence of both EF-β and EF (putative), towards
the bottom left corner of the gel (boxed in Figure 7.8). This finding, and the high sequence
similarity on the multiple alignment, demonstrates that mass spectrometry results for EF
(putative) and EF-β cannot always conclusively identify between these two proteins. How-
ever, the spots in the middle of the gel have long peptide spans that cover the N-terminus of
the protein sequence, which is more divergent than the C-terminus of the sequence between
EF-β and EF (putative). Therefore, these spots are likely to have been correctly identified.
Elongation factor 2
The image in Figure 7.9 displays the peptide spans of proteins matched to EF-2. There are
eight spots near the top of the gel which are probably differentially modified forms of the
complete protein, and the spots at the bottom of the gel are likely to be protein fragments.
There is no T. brucei EF-2 sequence deposited in GenBank as of May 2004, but there is an
EF-2 gene in GeneDB. The closest match in GenBank is Elongation Factor 2 from T. cruzi.
A sequence alignment reveals that Elongation Factor 2 is almost identical between T. brucei
and T. cruzi, indicating that the sequence has been correctly named. The last part of the
alignment is displayed in the lower part of Figure 7.9, and it appears that the end point of
the T. brucei sequence may have been incorrectly predicted.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 232
TRYP_x−70a06.p2kb545_355(T. brucei)gi|1800107|dbj|BAA09433.1| (T. cruzi)
Figure 7.9: The span of peptide hits for protein spots matched to Elongation Factor 2. Thealignment shows the 150 residues at the C-terminus of the EF-2 sequences from T. bruceiand T. cruzi. The boxed region shows that the end point of one of the sequences may nothave been predicted correctly, given the overall similarity between the two sequences is sohigh. Gel image courtesy of A. Faldas.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 233
Elongation factor γ
There is a single protein spot matched to EF-γ, near the bottom of the gel (orange and
boxed on Figure 7.6). A BLAST search of the EF-γ gene sequence hits only other EF-γ
sequences, and not the other EF genes, therefore this match is probably correct. However,
the spot is positioned near the base of the gel, indicating that this may only be a protein
fragment, therefore it is not definitive that the full protein of EF-γ is present on the gel.
Summary of elongation factor results
In summary, the results demonstrate that there are at least five genes encoding elongation
factors in T. brucei and many different protein spots appear on the 2-D gel, raising the
possibility that protein modifications are common. Modifications could regulate the activity
of elongation factor proteins, to achieve control over the translation of proteins. This is an
interesting area for further research because T. brucei does not modulate the rate of tran-
scriptional initiation, and it is likely that control over protein expression occurs downstream,
perhaps by regulating the rate of translation.
Heat shock proteins
The heat shock proteins (Hsp) are conserved across virtually all organisms, and are often
expressed in response to environmental stress. It has been shown that Hsps are up-regulated
when the temperature of the parasite’s environment is rapidly increased, for example during
transfer from the tsetse fly (25◦C) to the mammalian host (37◦C). At this time there are
extensive changes in morphology and metabolism of the parasite as it switches from the
procyclic form to the bloodstream form. It is thought that the expression of Hsp genes at
this time is crucial. It has been demonstrated that post-transcriptional control is exerted
to regulate the expression of Hsp70, and this control may be exerted at the level of mRNA
stability [193]. The proteome map of T. brucei suggests that many different protein forms
exist due the large number of distinct spots that have been matched to Hsp70, therefore
post-translational modifications may also be common.
The current level of annotation for T. brucei heat shock proteins is fairly poor, and
many spots on a single gel match Hsp70, although it is possible that in fact there are several
closely related genes, rather than the 62 distinct protein spots arising from one gene. An
analysis was carried out to identify how many distinct genes coded for the 62 protein spots
observed. Five distinct protein sequences were obtained that had been matched by mass
Chapter 7. Software support for a proteome map of Trypanosoma brucei 234
Figure 7.10: A multiple alignment of five Hsp 70 protein sequences from T. brucei ; a) =TRYPtp2h24gd03.q1k 1, b) = TRYPtp30n4hh05.p1k 3, c) = TRYP xi-1015g04.q1k 13, d)= 125.m00218, e) = 92.m00252.
spectrometry data, all predicted to be Hsp70 by BLAST searches. A multiple alignment of
the five sequences has been performed (Figure 7.10). All five sequences are highly related
but no two appear similar enough to have arisen from an incorrect prediction of a single
gene, therefore there appear to be at least five distinct Hsp70 genes that exist in T. brucei.
The first sequence in the alignment is significantly shorter than the other four, possibly
indicating that the start of this gene has been incorrectly predicted, or it is a pseudogene.
The similarity between all sequences raises the possibility that mass spectrometry matches
to these proteins could be incorrect, however there are few long stretches in any sequence
that are identical to a different sequence, therefore it is likely that most peptide matches
will be made correctly. A study by Lee in 1998 suggested that there is an Hsp70 locus in
T. brucei containing 6 identical genes [193]. A search of GeneDB (May 2004) for the text
query: “∗heat shock protein∗” and “∗hsp∗” finds seven proteins that are predicted to be an
Hsp 70, of which four are clustered on chromosome 11, which may be the locus reported by
Lee, one sequence on chromosome 9 and two on chromosome 7.
A multiple alignment has been performed of the sequences retrieved from the current
release of GeneDB against those from the MS analysis, which come from GenBank and older
Chapter 7. Software support for a proteome map of Trypanosoma brucei 235
Figure 7.11: Protein spots matched to five different Hsp70 protein sequences. 125.m00218= blue; 92.m00252 = red; TRYP xi-1015g04.q1k 13 = yellow; TRYPtp30n4hh05.p1k 3 =white; all spots marked as cyan contain peptides that hit both TRYPtp2h24gd03.q1k 1(cyan) and TRYPtp30n4hh05.p1k 3 (white). Gel image courtesy of A. Faldas.
downloads of GeneDB that were used for the original MS analysis over the last year. The
alignment is displayed at the end of the chapter. Ten out of the twelve sequences appear
to be distinct, and two of the sequences from the MS analysis are identical to sequences
on chromosome 11. It is possible that these are the same genes however it is not possible
to verify, as there is no correspondence between different versions of sequence identifiers in
GeneDB. Two sequences in GeneDB: Tb09.160.3090 and Tb07.29K4.60 are annotated as
Hsp70, and contains several motifs that are highly similar to other Hsp70 sequences. Over
the full length however, they are more divergent and are 25% longer than the other Hsp70
sequences, therefore would be predicted to have a higher molecular weight and may in fact
be a closer match to a different heat shock protein.
Figure 7.11 displays which protein spot matches which sequence in the genome database.
There are distinct clusters of spots that match the sequence 92.m00252 (red) and 125.m00218
(blue). Only one sequence matches TRYP xi-1015g04.q1k 13 (yellow) at the bottom of the
gel, therefore this may only be a protein fragment. There is a cluster of spots predicted
to match TRYPtp30n4hh05.p1k 3 (white) and TRYPtp2h24gd03.q1k 1 (cyan), however all
Chapter 7. Software support for a proteome map of Trypanosoma brucei 236
those coloured cyan are matched to peptides that also exactly match TRYPtp30n4hh05.p1k 3
(white) therefore it is not possible to say from this analysis which is the correct protein
identification. It is possible that the MS results have incorrectly predicted the identity of
the proteins coloured cyan or white. It is not possible to say definitively that the protein
TRYPtp2h24gd03.q1k 1 (cyan), which has a very short sequence, is expressed in this sample,
and it may be a pseudogene.
7.3.2 Using data in RAPAD to improve genome annotation
An interface has been developed which allows external databases to link to protein records
in RAPAD. Unique ID numbers have been assigned to proteins that identify the database
version (v. 1) so that in future database versions, a link can be provided to the most recent
records. A record displays the protein name, has a link to the corresponding gel with the
spot highlighted, and provides evidence about the quality of the match to MS data. When
the data is released to the public, the web page for each protein can be referenced from other
databases. Alternatively, a more robust approach would be for other databases to store the
unique ID number that has been assigned to each protein, and maintain a single URL to
where the current implementation of the database is located. This feature will be used by
the genome database, when the existence of a protein has been verified by the proteome
map, as discussed in Chapter 5. The interface that allows public access to T. brucei data in
RAPAD is displayed in Figure 7.12.
Hypothetical proteins
An analysis has been performed to find the number of distinct proteins stored in RAPAD
which are named as a “hypothetical protein”. A simple search of RAPAD for the word
hypothetical in the protein name reveals 100 matching entries that arise from 47 distinct
spots on the master gel. It is therefore likely that the actual number of proteins that are
annotated as hypothetical on the master gel, is somewhere between 47 and 100 because it is
possible that there is more than one distinct protein annotated as hypothetical in a single
spot. However, given that many sequences have not been manually curated, the genome
database may contain a large number of open reading frames that have been incorrectly
predicted, and the sequence may have been hit by chance. A further database search reveals
that 24 out the 100 proteins are matched with a sequence coverage of less than 5%, therefore
these may not be true matches.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 237
Figure 7.12: The interface for publishing T. brucei proteome data. The initial page displaysimages of gels that are stored and the number of identified proteins on each gels. A list ofproteins can be generated and individual records can be displayed.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 238
Figure 7.13: A search using the Gel Viewer reveals 100 proteins, annotated as “hypothetical”.Gel image courtesy of A. Faldas.
The protein sequences hit by MS data were obtained for 88 out the 100 sequences. The
other 12 sequences could not be obtained because the ID numbers that are stored in RAPAD
have changed in GeneDB, and there is no link to the current record. This is a major problem
for researchers working with genome sequences before they are complete because temporary
ID numbers are assigned to proteins that are later deleted, and databases often do not
maintain an archive of previous identifiers. For the spots that do not link to a current
GeneDB record, the MS searches must be repeated, which is time consuming. There will
be an option in the next release of RAPAD to perform repeated MS searches automatically.
Of the 88 sequences, there were 57 distinct protein sequences that have been matched. A
piece of software was written that matches the peptide sequences hit by mass spectrometry
results to the protein sequences, to determine which spots on the gel matched which protein
sequence. It was discovered that there are ten proteins that have been matched to more than
one spot, in total matched to 33 spots. The diagram in Figure 7.13 displays all the spots
that have been annotated as matching a protein whose name contains the word hypothetical.
Many of the proteins lie at the bottom of the gel, indicating possible protein fragments that
may have been matched to short protein sequences in the database. Several of the database
sequences annotated as hypothetical are very short, of which the shortest contains only 39
Chapter 7. Software support for a proteome map of Trypanosoma brucei 239
1)
3)
2)
Figure 7.14: The protein spots that have been matched to different hypothetical proteins.The spots with the same colour label have been matched to the same database sequence.The three boxed regions are discussed further in the text. Gel image courtesy of A. Faldas.
amino acids, and is very unlikely to be a correctly predicted protein.
There are ten hypothetical proteins that have been matched to more than one gel spot.
Groups of spots matched to the same protein are displayed in a particular colour on Figure
7.14. Three pairs of spots that have been matched to three different hypothetical proteins
have been highlighted for further study because they reside in the middle of the gel, therefore
are unlikely to be protein fragments. Furthermore, two spots matched to one protein, located
next to each other, are unlikely to be incorrect matches because the probability of two
adjacent spots independently matching the same sequence is low. However, it is still possible
that an incorrect match could be made to a short “hypothetical” protein sequence in the
database if there were two spots containing the same protein that had a peptide that matched
the hypothetical protein by chance.
Spot group 1
The spots marked 1 in Figure 7.14 (Spot IDs 313 and 275) are both fairly strongly matched
to a 438 amino acid protein, annotated as “Conserved hypothetical protein”. The left spot
contains only this protein, the right spot is predicted to match five different proteins: ATPase
Chapter 7. Software support for a proteome map of Trypanosoma brucei 240
Query Obs Mr(expt) Mr(calc) Delta Miss Score Expect Rank Peptide
Figure 7.15: Four spots containing arginine kinase. The MS results for spots 575 and 535reveal possible modifications. Gel image courtesy of A. Faldas.
Initiation factor
There are four spots that have been strongly matched to eukaryotic initiation factor 5 (Fig-
ure 7.16). Of these, Spot 575 contains both initiation factor and arginine kinase by chance.
A deamidation has been observed for the match to initiation factor protein. Spot 554 has
also been predicted to have undergone deamidation. The spots are all likely to have slight
differences in the chemical sidechains, causing the four different spots to appear. A deami-
dation causes a slight change in mass and an alteration in the charge of the protein but it is
likely that there are other modifications that are not observed in the MS data, which cause
the different spots to appear.
7.3.4 Results Summary
The investigations into multiple protein products demonstrate the core functionality of RA-
PAD. RAPAD supports the finding and visualisation of spots that have been identified as the
same protein. Additional software was developed alongside RAPAD to determine the range
of peptides that were matched in mass spectrometry results, and to provide a visualisation
of the clusters. The visualisation software highlighted some unusual results for the tubulin
Chapter 7. Software support for a proteome map of Trypanosoma brucei 243
Figure 7.16: There are four spots that match initiation factor 5, of which possible modifica-tions were found for spots 554 and 575. Gel image courtesy of A. Faldas.
proteins and, coupled with the multiple alignments, should improve annotation of Elongation
Factor sequences. The visualisation of heat shock protein 70 results indicates that there are
at least five different gene sequences from which Hsp70 proteins are expressed in the sample.
The visualisation makes it clear that only very short spans of peptides are present in spots
at the base of the gel, indicating that they are protein fragments. It is an area for future
investigation to determine if these are biologically meaningful, or experimental artifacts.
The analysis demonstrates a strong correlation between spots that are proximally located
and the span of peptide hits, even for spots that are not fragments but probably contain full
length proteins. An investigation was also carried out to verify that sequences annotated as
hypothetical proteins in the genome database were real proteins identified in the proteome
study. Three proteins were analysed in detail, of which two of the sequences are likely to be
real proteins, but a definitive function cannot be assigned at this time. The other protein
appears to be an ATPase in T. brucei, and the next version of the genome database should
update this annotation. Finally, a search for PTMs within MS data was undertaken and
several potential sites were found. There are major limitations with the method of searching
for PTMs and therefore other experiments are required to confirm modifications. The issues
raised by the results are discussed in Section 7.4.1.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 244
7.4 Discussion
The annotation of an organism’s genome is a major challenge once sequencing is nearing
completion. The usual method to assign functions to newly sequenced genes is to apply
computational methods to find proteins in other organisms that are homologous and have
a functional assignment. After this initial stage, the slow process begins of performing
laboratory investigations to determine the mechanism of action of proteins, and to search
for the proteins that are important in disease. The biological goals of the trypanosome
project are to catalogue all the expressed proteins that can be found by various methods.
In particular, the proteome study is able to verify that genes annotated as “hypothetical
proteins” are expressed in the particular cell line. The project also aims to shed light on the
number of different forms of proteins that are found.
The core functionality of RAPAD has aided the management of large volumes of data
for the proteome investigation. This has been facilitated due to the feature that allows bulk
uploads of data, enabling the protein identifications to be moved easily from the previously
used method (spreadsheets). This reduces the overhead of manual data entry which is time
consuming and error prone. The database query facilities allow the data set to be searched
and filtered, which is important for large data sets. In Section 7.2, a series of questions was
outlined that RAPAD may be able to solve, which are answered here.
Q. Can the time and labour to identify proteins be reduced?
The RAPAD Querier allows researchers to verify which proteins have been strongly or weakly
matched, and there is a facility for loading very large amounts of protein data in bulk, into the
system. However, in the current implementation there is no automated pipeline for moving
raw mass spectrometry data to the MASCOT server, and placing the results of searches in
RAPAD. This feature will be considered for the next version of the database (Section 7.4.4).
Q. How many different proteins can be identified from 2500 spots?
In the system at the present time there are almost 1000 identified proteins for 650 spots,
across three gels. In the previous chapter, the combination of data across replicate gels was
discussed, therefore the system should easily scale up to 2500 spots and many more.
Q. How widespread and common are post-translational modifications?
The additional investigations into the causes of multiple spots that match the same protein
Chapter 7. Software support for a proteome map of Trypanosoma brucei 245
demonstrate that post-translational modifications (PTMs) are very common for some pro-
teins. The software was also able to demonstrate that many of the spots near the base of the
gel are almost certainly fragments of proteins, and are not caused by PTMs. A search of MS
results to confirm types of modification did not reveal any significant results, demonstrating
that more biological investigations are required.
Q. How can we improve the T. brucei genome annotation?
The interface for publishing data allows the genome database to connect to records in RA-
PAD, which verify the existence of proteins. The analysis reported in this chapter will aid
the annotation of several groups of genes, summarised in Section 7.4.1.
Q. Can we build a “point and click” virtual 2D gel?
The Gel Viewer provides this facility by dynamically linking the spots on the gel to individual
records for each protein. The results of complex queries in RAPAD can also be visualised in
the Gel Viewer, providing a system for data analysis and management that is more powerful
than the facilities offered by commercial image analysis applications.
Q. Can we build pages that give original MS data interpretations?
One feature that has not been employed at this time in RAPAD is to automate repeated
searching of MS data, for example to search for different types of PTM that could be found
within the data. A number of searches have been performed manually in MASCOT to find
modifications on peptides, however very few positive results have been obtained. Therefore,
there may be limited benefits in implementing an automated search at this time. The graphic
showing peptide hits within protein sequences is a novel visualisation of MS results, and is
discussed further in Section 7.4.2.
The database does not store raw MS data in the present implementation due to the size of
the files and the fact that the raw data is in a proprietary format that can only be interpreted
with software that is installed on a few terminals, which is a major drawback for re-analysis
of data. The next version of RAPAD may include an automated system for analysing MS
data, similar to the SASHIMI software [278] developed at the Institute for Systems Biology,
in the proteomics group headed by Ruedi Aebersold. SASHIMI is open source software that
aims to improve the downstream analysis of MS data. It comprises an application that
converts raw MS data, from any of the instruments that are available, into a single XML-
Chapter 7. Software support for a proteome map of Trypanosoma brucei 246
based format that can be analysed with a number of software packages to standardise the
identification of proteins.
7.4.1 Improving the annotation of genes
The additional investigations identified several sequences in the genome database, which may
have been incorrectly predicted. The study also discovered that there are several different
proteins with highly related, but not identical sequences, which have the same protein name.
It is likely that the protein families were formed by relatively recent gene duplication events,
and the function of these protein families may be redundant. However, it is also possible that
different members of the family perform slightly different roles. For example, the finding that
up to five different proteins, annotated as heat shock protein 70, are strongly expressed raises
the possibility that all of the different forms are functionally significant. It is believed that
Hsps may be important when trypanosomes infect the mammalian host, therefore clearly
the current naming strategy for these proteins is inadequate. At this time gene annotation
is not yet finalised for T. brucei therefore we believe that the Hsp genes should have a suffix
on the name that uniquely identifies each one, for example the chromosome position of each
gene, plus a letter if more than one sequence resides on the same arm of a chromosome e.g.
Hsp70 (11 p a).
The analysis reveals that most proteins near the bottom of the gel, in the small molecular
weight range, have very short spans of peptides matched, indicating that these proteins are
likely to be fragments caused either experimentally, or in vivo. It is possible that these
spots do not have great biological significance. The visualisation software highlighted an
unexpected result for both β-tubulin and α-tubulin. It was observed that several spots near
the base of the gel, which would be predicted to have a low molecular weight, matched
peptides from the two ends of the protein sequence. There are several possible explanations
for this result, one of which is that splicing occurs at the level of mRNA, resulting in a
protein that is formed from the two ends of the gene sequence. The evidence presented
here is far from sufficient to confirm this hypothesis but it is still open for discussion how
these spots arose. Additional experiments are required to investigate the result, for example
by performing MS/MS to sequence as many peptides as possible to determine the exact
constituents of the protein spot.
An interesting finding from the visualisation is that proteins of the same name, in the
same region of the gel, tend to have a similar span of peptide hits. It might be expected that
Chapter 7. Software support for a proteome map of Trypanosoma brucei 247
the distribution of peptides matched from the same protein would be fairly similar for all
spots regardless of their position on the gel, only subject to random variation in ionisation
and detection of peptides. It is known that certain chemical modifications cause peptides
to ionise less well in MS, such as phosphorylation, therefore it could be expected that spots
that have the same span of peptide hits, have a shared set of modifications to the peptides
that are detected by MS. Small differences in the range of peptides matched between spots
located near each other could indicate the loss or gain of a modification. For example, if pro-
tein A matches peptides covering the range 50-80 amino acids in the sequence, and protein
B matches peptides covering 50-95, this may indicate that protein A has an additional phos-
phate group on the peptide from position 81-95, preventing its detection by MS. However,
there is also a technical explanation for the correlation between peptide span and spot posi-
tion. Spots closely co-located are more likely to have been included in the same MS run and,
as ionisation efficiency is highly variable, spots on the same MALDI plate may be subject to
more similar ionisation conditions. This is an area that requires further investigation, such as
performing experiments with radioactive or fluorescently labelled phosphates, coupled with
the visualisation software, to determine the phosphorylation status of protein spots and the
span of peptide hits to verify if the peptides matched are related to modification status. Our
results indicate that there is a high correlation between the peptides detected by MS and a
protein’s position on a gel, and as far as we are aware, this has not been previously reported.
7.4.2 Visualisation issues in the life sciences
In general, the visualisation of life sciences data requires significant further research, and
there are few examples of published work concerning investigations into best practice for
visualising large data sets. Software for biomedical applications is often created without
developers applying standard guidelines for graphical user interface design, leading to the
generation of systems that are not intuitive for users.
The visualisation of the span of peptide hits is a new method for viewing mass spectrom-
etry results on a 2-D gel. A similar approach could be adopted to view microarray results,
such as displaying the extent of the hybridization signal for different probes within each
feature on the array. The other use of the Gel Viewer reported in Section 7.3.2, in which
different colours are used to display the clusters of spots that match the same protein, is a
standard method for summarising complex data, and could potentially be used to display a
variety of functional genomics data.
Chapter 7. Software support for a proteome map of Trypanosoma brucei 248
The visualisation software displaying the span of peptide hit will be included in the next
release of RAPAD and could be adapted to show other facets of the mass spectrometry
results. The height of the bar could be used to indicate the e-value or score assigned by the
search software. The software could be also adapted to display different proteins that have
been matched to the same gel spot, using different shading of bands on the spot label.
7.4.3 Analysis of modifications
On the 2-D gel, there are several clusters of spots that match the same protein, which are
likely to be the result of different PTMs causing slight changes in the mass or charge of the
protein. A search was performed on MS data to confirm the modifications but this only
revealed a few results that are not highly significant. The main problems are that many
proteins are identified by only a small proportion of the total peptides in the sequence and
the majority of the modifications will not be observed. This issue was discussed in more
detail in the previous chapter but the additional visualisations presented in this chapter
could ultimately help to find and display modifications. The graphic showing the peptide
that had been matched could be modified to display more detailed information. For example,
the labelling bars next to spots could display the peptides that have been matched along the
length of the protein, with a graphic showing possible modification sites along the protein
sequence, obtaining the sites from an in silico analysis of the protein sequence, or a database
of known modifications. If a particular peptide, that was detected in one spot and not
another, had a known modification site, this could provide evidence that the peptide had
been modified in one of the spots. The major hindrance to this effort is that there are
no major databases of modifications, even though there is a very large amount of research
that has been performed over several decades identifying modifications. It is hoped that the
future integration of RAPAD into GUS will allow researchers to publish and distribute data
about modifications to the wider research community.
7.4.4 Future work
The proteome map of T. brucei comprises 2-DE and MS derived data. It is planned that
other techniques, such as LC-MS (reported in Chapter 1), will be used to generate even
greater volumes of protein data. The RAPAD database schema has capabilities for storing
this kind of information but the web pages have not yet been created for data entry or the
visualisation of results. A major issue will be the integration of this data with the gel based
Chapter 7. Software support for a proteome map of Trypanosoma brucei 249
studies. In the near future the data must also be made available to the curators of GeneDB
to enable improvements to the annotation of genes. The long term goals are to integrate
the proteome part of RAPAD into GUS, which will enable the proteome data to be stored
directly within GeneDB.
7.5 Conclusions
The core functionality of RAPAD has greatly improved the data management facilities for
the Trypanosoma brucei proteome project by enabling queries over the large data set to
find proteins of interest. Additional investigations have been performed on several groups
of proteins that appear abundantly on 2-D gels, for which the genome annotation is poor.
The results demonstrate one way in which experimental data, coupled with bioinformatics
analysis, can find protein sequences that have been incorrectly predicted. The visualisation
of results in new ways could be applied to proteome data from any organism and would aid
the annotation of newly sequenced genomes. The large data set generated by the T. brucei
investigation also demonstrates the scalability of the current implementation of RAPAD.
Appendix: Alignment of Hsp70 sequences
A multiple alignment has been performed with ClustalW on twelve sequences predicted to
match heat shock protein 70. Five sequences are from the MS results matched by proteins
in the T. brucei proteome map, and seven sequences are from the current version of the T.
brucei genome database (Section 7.3.1). Tb09.160.3090 and Tb07.29K4.60 are considerably
longer than the other sequences, and align poorly. Therefore, they may have been incorrectly
predicted or if correctly predicted, should be named as a different heat shock protein e.g.
references TREATMENT (TREATMENT_ID) not deferrable
/
Appendix D
Modelling and database storage of
difference gel data
D.1 Introduction
The focus of the thesis is to improve technology for the management and sharing of proteome
data arising from 2-DE and MS. Chapter 1 reports three case studies: (i) a host-parasite
interaction study, (ii) the study of changes in the proteome of cell culture with a knock-
out of the gene Raf-1, and (iii) the determination of the proteome of Trypanosoma brucei.
The three case studies were used to inform the development of Gla-PSI and the data from
case studies (i) and (iii) are stored in RAPAD, as reported in Chapters 6 and 7. Case
study (ii), performed at the Beatson Institute, focused on a difference gel electrophoresis
(DIGE) experiment to find differentially expressed proteins. However, the data from the
DIGE study did not become available for inclusion in RAPAD due to technical difficulties
with the experimental setup. DIGE is becoming a major technique in proteomic analysis
because it allows more accurate determination of relative protein volume between two or
more study groups than standard gel electrophoresis. As case studies (i) and (iii) utilised
standard 2-DE analysis, the main data sets used for testing the technology we developed
did not include DIGE data. The purpose of this Appendix is to demonstrate that Gla-PSI,
FGE-OM and RAPAD are capable of representing DIGE data.
Chapter 6 describes a study of the proteome of host cells when invaded with a para-
site compared with non-invaded host cells, measured using standard 2-DE. The experiments
have recently been extended to study the proteome using the DIGE technique. The follow-
ing section (Section D.1.1) briefly describes the experimental methodology and Section D.2
illustrates how such DIGE data can be represented in Gla-PSI. Section D.3 describes how
the same experiment can be captured in FGE-OM. The data has recently been added to
342
Appendix D. Modelling and database storage of difference gel data 343
Replicate Cy2 Cy3 Cy5
1 S Inf1 Non12 S Inf2 Non23 S Non3 Inf34 S Non4 Inf4
Table D.1: Experimental plan for Cy labelling of proteins in the DIGE experiment withToxoplasma gondii. S = pooled sample from all eight replicates, Inf1 = Infected samplereplicate 1, Non1 = Non-infected sample replicate 1.
RAPAD, as described in Section D.4, and can be viewed within the Gel Viewer.
D.1.1 Host-parasite responses
In this section, there is a brief outline of a study to elucidate the changes in the proteome
of a human cell culture when invaded with a parasite, compared with non-invaded cells.
The study was performed in the laboratory of Dr Jonathan Wastling at the Institute of
Biomedical and Life Sciences, University of Glasgow, and it was performed by Morag Nelson,
a PhD student. The DIGE investigation accompanies the standard 2-DE studies described
in Chapter 6. RAPAD aids the information retrieval task, the combination of data across
replicate gels and the comparison with microarray results. There are details about the
hypothesis of the investigation and the generation of samples in Chapter 6, however the
following experimental procedure was used for DIGE analysis.
Four biological replicates were performed (four infected HFF samples versus four non-
infected). The samples were labelled with Cy dyes as shown in Table D.1. A fifth gel was
run with pooled material from the non-infected and infected samples. The fifth gel was used
for generating samples for mass spectrometry (MS) to identify the proteins. The gels were
scanned and gel images loaded into DeCyderTM[74] software. The software performs spot
matching across a series of gels and quantifies the difference in fluorescence, corresponding to
the relative abundance of a particular protein between the infected and non-infected samples.
D.2 Gla-PSI
Gla-PSI is shown in Figure D.1 and the main classes that are used to store DIGE data are
boxed. Figure D.2 demonstrates how classes in Gla-PSI capture the parasitology experiment
described above. ExperimentDesign describes the purpose of the experiment (infected ver-
sus non-infected samples) and ExperimentParameters captures the replicates described in
Appendix D. Modelling and database storage of difference gel data 344
IDEvidence
MassSpec
The stages preceding image analysis have been presented in models: MAGE http://www.mged.org and PEDRo http://pedro.man.ac.uk
Class A
Class B
New classes inthe model
Classes derived from MAGE or PEDRo
Legend
Database
version : StringURI : String
Identifiable
identifier : Stringname : String
All classes are subclasses of Identifiable and Describable (not shown). Therefore, all classes can have an identifier attached and be linked to annotation classes.
Figure D.2: A DIGE experiment represented in Gla-PSI. The boxes represent classes in themodel and the text in each box is a comment to describe the purpose of the class or exampleattributes and values. The lines indicate relationships between classes and the numbersrepresent the relative number of classes (cardinality) that participate in the relationship.
Appendix D. Modelling and database storage of difference gel data 347
gel, which are represented in BioMaterial, and each BioMaterial is associated with an in-
stance of BioAssayTreatment. BioAssayTreatment is a superclass from which specific types
of treatment can inherit relationships. In this case, an instance of Gel2D is used to capture
the details of the two-dimensional separation. BioAssayTreatment is associated with the
PhysicalBioAssay class that is used for linking various classes together. ImageAcquisition
(scanning) is linked to a source PhysicalBioAssay and an output PhysicalBioAssay that
is related to the four images: three from scanning at the three fluorescence wavelengths and
one composite image (captured in Image, Channel and the image format in OntologyEntry).
The gel that is used for spot picking is modelled in the same way but the Channel class is
not required as the gel does not contain fluorescent labels and has not been scanned at a
particular wavelength. Gel image analysis is modelled by the general MAGE-OM derived
class FeatureExtraction and a more specific class GelImageAnalysis. The five different
gels are related to each other through the class MultipleAnalysis. MeasuredBioAssay re-
lates the image analysis event to the data via MeasuredBioAssayData (not shown). There
is a relationship to the class BioDataTuples that stores rows of gel spot data, with rela-
tionships to IdentifiedSpot and DIGESingleSpot. IdentifiedSpot represents composite
spot information (from the image combined across the three channels) and DIGESingleSpot
stores attributes of spots in the single channel images. Spots that are matched across more
than one gel, for example matched between the standard gel used for MS analysis and the
DIGE gels, are stored in MatchedSpots which is related to the class MultipleAnalysis.
IdentifiedSpot and DIGESingleSpot have various attributes that are measured by image
analysis such as relative volumes, ratios between the different channels and a spot’s co-
ordinates. Complete class diagrams showing all the attributes are displayed in Appendix
B.
D.4 RAPAD
The parasitology study described above has recently been entered into RAPAD. There are
17 gel images in total in the study, corresponding to four images from each of the four
DIGE gels (three different wavelengths and a composite image) plus a single gel image from
the prep gel used for MS identification. Figures D.4 displays screenshots of the prep gel
visualised in the Gel Viewer. The DeCyderTMsoftware calculates the relative volume ratio
between the two study conditions (infected versus non-infected) across all four DIGE gels.
The ratio of volumes is stored in the IdentifiedSpot table, linked to the prep gel (stored in
Appendix D. Modelling and database storage of difference gel data 348
Experiment
BioMaterialTreatment
Gel2D BioAssayTreatment
Image
Channel FeatureExtraction
GelImageAnalysis
MultipleAnalysis
MatchedSpots
BioDataTuples
IdentifiedSpot
13
DIGESingleSpot
MeasuredBioAssay
PhysicalBioAssay
Picklist gel (non−DIGE) is stored in the same structures but does not require the Channel class
ImageAcquisition
ScanningprotocolPhysical
BioAssay
1
1
11
Composite spotinformation
spots matchedacross gels
scanningwavelength
link to OntologyEntry for imageformat
proteinsolubilisation and labelling
Hypothesis andparameters
Separationdetails
Protocol for gelimage analysis
1
1
11
1
1
*
*
1
1
*
1
2
1
1 5
Treatments produce 5BioMaterials (4 for DIGE1 for standard gel)
2
1 *
1
*
1
1
4
1
1
11
1 17
Figure D.3: A DIGE study represented in FGE-OM. The boxes represent classes in themodel and the text in each box is a comment to describe the purpose of the class. Thelines indicate relationships between classes and the numbers represent the relative numberof classes (cardinality) that participate in the relationship for this experiment.
Appendix D. Modelling and database storage of difference gel data 349
Gel2D, ProteomeAssay, ImageAcquisition and GelImageAnalysis, described in Chapter
5). The organisation of the data in this way allows a simple visualisation of the proteins up
or down-regulated without needing to examine the entire series of images because the Gel
Viewer allows the user to perform searches for the spot volume. All the DIGE gel images
are stored in RAPAD within the same study and can also be loaded concurrently with the
prep gel in the Gel Viewer. Proteins with a volume greater than zero are present in higher
abundance in non-infected cells, and less than zero are in higher abundance in infected cells.
RAPAD contains microarray, standard 2-DE and DIGE data for HFF cells invaded with
T. gondii. This means that comparisons can be made between the level of gene expression
and protein abundance as measured by more than one technique. This allows for validation
of the experimental methodology, and the derivation of significant biological information
about the proteins modulated in response to parasite invasion of host cells.
Appendix D. Modelling and database storage of difference gel data 350
B)
A)
Figure D.4: Relative protein abundance data calculated from DIGE can be viewed in theGel Viewer via the gel used for protein identification by MS. The user can query for proteinsdown-regulated (panel A) or proteins up-regulated (panel B) in the Gel Viewer.
Bibliography
[1] S. Abiteboul, S. Cluet, V. Christophides, T. Milo, G. Moerkotte, and J. Simeon. Query-ing Documents in Object Databases. Int. J. on Digital Libraries, 1:5–19, 1997.
[2] F. Achard, G. Vaysseix, and E.Barillot. XML, bioinformatics and data integration.Bioinformatics, 17:115–125, 2001.
[3] C. Adessi, C. Miege, C. Albrieux, and T. Rabilloud. Two-dimensional electrophoresis ofmembrane proteins: A current challenge for immobilized pH gradients. Electrophoresis,18:127–135, 1997.
[4] R. Aebersold and M. Mann. Mass spectrometry-based proteomics. Nature, 422:198–207, 2003.
[5] Affymetrix. http://www.affymetrix.com/.
[6] J. W. Ajioka, J. M. Fitzpatrick, and C. P. Reitter. Toxoplasma gondii genomics:shedding light on pathogenesis and chemotherapy. Expert Rev Mol Med., 2001:1–19,2001.
[7] F. Al-Shahrour, R. Diaz-Uriarte, and J. Dopazo. FatiGO: a web tool for findingsignificant associations of Gene Ontology terms with groups of genes. Bioinformatics,20:578–580, 2004.
[8] A. Alban, S. O. David, L. Bjorkesten, C. Andersson, E. Sloge, S. Lewis, and I. Cur-rie. A novel experimental design for comparative two-dimensional gel analysis: two-dimensional difference gel electrophoresis incorporating a pooled internal standard.Proteomics, 3:36–44, 2003.
[9] J. Allen, H. M. Davey, D. Broadhurst, J. K. Heald, J. J. Rowland, S. G. Oliver, andD. B Kell. High-throughput classification of yeast mutants for functional genomicsusing metabolic footprinting. Nat Biotechnol., 21:692–696, 2003.
[10] AllGenes: a web site providing access to an integrated database of known and predictedhuman and mouse genes. (version 6.0, 2003) Center for Bioinformatics, University ofPennsylvania. http://www.allgenes.org.
[11] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, andD. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein databasesearch programs. Nucleic Acids Res., 25:3389–3402, 1997.
[12] AmiGO. http://www.godatabase.org/.
[13] Analytical Information Markup Language (AnIML).http://animl.sourceforge.net/.
[16] ArrayExpress at the EBI. http://www.ebi.ac.uk/arrayexpress/.
[17] G. Arrizabalaga and J. C. Boothroyd. Role of calcium during Toxoplasma gondiiinvasion and egress. Int J Parasitol., 34:361–368, 2004.
[18] ASTM International. http://www.astm.org.
[19] M. P. Atkinson, L. Daynes, M. J. Jordan, T. Printezis, and S. Spence. An OrthogonallyPersistent Java. SIGMOD Record, 25(4):68–75, 1996.
[20] G. Babnigg and C. S. Giometti. GELBANK: a database of annotated two-dimensionalgel electrophoresis patterns of biological systems with completed genomes. NucleicAcids Res., 32:D582–D585, 2004.
[21] A. Bahl, B. Brunk, R. L. Coppel, J. Crabtree, S. J. Diskin, M. J. Fraunholz, et al.PlasmoDB: The Plasmodium Genome Resource. An integrated database providingtools for accessing and analyzing mapping, expression, and sequence data (both finishedand unfinished). Nucleic Acids Res., 30:87–90, 2002.
[22] P. G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A. Brass. AnOntology for Bioinformatics Applications. Bioinformatics, 15:510–520, 1999.
[23] C. A. Ball, G. Sherlock, and H. Parkinson. An open letter to the scientific journals.Science, 298:539, 2002.
[24] C. A. Ball, G. Sherlock, and H. Parkinson. An open letter to the scientific journals.Bioinformatics, 18:1409, 2002.
[25] C. A. Ball, G. Sherlock, and H. Parkinson. An open letter to the scientific journals.The Lancet, 360:1019, 2002.
[26] M. P. Barrett. The fall and rise of sleeping sickness. The Lancet, 353:1113–1114, 1999.
[27] J. D. Barry. The relative significance of mechanisms of antigenic variation in Africantrypanosomes. Parasitology Today, 13:203–244, 1997.
[28] S. Bechhofer, I. Horrocks, C. Goble, and R. Stevens. OilEd: a reason-able ontology edi-tor for the semantic web. In Proceedings of KI2001, Joint German/Austrian conferenceon Artificial Intelligence, pages 396–408, 2001.
[29] C. J. Beckers, J. F. Dubremetz, O. Mercereau-Puijalon, and K. A. Joiner. The Tox-oplasma gondii rhoptry protein ROP 2 is inserted into the parasitophorous vacuolemembrane, surrounding the intracellular parasite, and is exposed to the host cell cy-toplasm. J Cell Biol., 127:947–961, 1994.
[30] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler. Gen-Bank. Nucleic Acids Res., 31:23–27, 2003.
[31] T. Berners-Lee and J. Hendler. Nature Debates: Scientific publishing on the ‘semanticweb’. http://www.nature.com/nature/debates/e-access/Articles/bernerslee.htm.
[32] BIND at Blueprint. http://www.blueprint.org/bind/bind.php.
352
[33] Bioinformatic Harvester, Collection of all human (non fragmented) SWALL proteinsand their cross references to the major bioinformatic databases.http://harvester.embl.de/.
[34] BioJava. http://www.biojava.org.
[35] I. J. Blader, I. D. Manger, and J. C. Boothroyd. Microarray analysis reveals previouslyunknown changes in Toxoplasma gondii -infected human cells. J Biol Chem., 276:24223–24231, 2001.
[36] B. Blagoev, I. Kratchmarova, S. E. Ong, M. Nielsen, L. J. Foster, and M. Mann. Aproteomics strategy to elucidate functional protein-protein interactions applied to EGFsignaling. Nat Biotechnol., 21:315–318, 2003.
[37] B. Boeckmann, A. Bairoch, R. Apweiler, M. C. Blatter, A. Estreicher, E. Gasteiger,et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.Nucleic Acids Res., 31:365–370, 2003.
[38] S. Boldt, U. H. Weidle, and W. Kolch. The role of MAPK pathways in the action ofchemotherapeutic drugs. Carcinogenesis, 23:1831–1838, 2002.
[39] S. Bowers and B. Ludascher. An Ontology-Driven Framework for Data Transformationin Scientific Workflows. In Proceeding of the International Workshop on Data Integra-tion in Life Sciences, Lecture Notes in Computer Science, volume 2994, pages 1–16,2004.
[40] Tim Bray. What is RDF? http://www.xml.com/pub/a/2001/01/24/rdf.html.
[41] A. Brazma, P. Hingamp, J. Quackenbush, G. Sherlock, P. Spellman, C. Stoeckert, et al.Minimum information about a microarray experiment (MIAME)-toward standards formicroarray data. Nat. Genet., 29:365–71, 2001.
[42] A. Brazma, A. Robinson, G. Cameron, and M. Ashburner. One-stop shop for microar-ray data - Is a universal, public DNA-microarray database a realistic goal? Nature,403:699–700, 2000.
[43] P. Buneman, M. Grohe, and C. Koch. Path Queries on Compressed XML. In Proceed-ings of 29th International Conference on Very Large Data Bases, Berlin, Germany,pages 141–152, 2003.
[44] Peter Buneman. Semistructured data. In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 117–121,1997.
[45] A. Burger, D. Davidson, and R. Baldock. Formalization of Mouse Embryo Anatomy.Bioinformatics, 20:259–267, 2004.
[46] E. Camon, M. Magrane, D. Barrell, D. Binns, W. Fleischmann, P. Kersey, et al. TheGene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT,TrEMBL, and InterPro. Genome Res., 13:662–672, 2003.
[47] D. Carlson. Modeling XML Applications with UML: Practical e-Business Applications.Addison-Wesley, 2001.
353
[48] S. Carr, R. Aebersold, M. Baldwin, A. Burlingame, K. Clauser, and A. Nesvizhskii. Theneed for guidelines in publication of peptide and protein identification data: WorkingGroup on Publication Guidelines for Peptide and Protein Identification Data. Mol CellProteomics., 3:531–533, 2004.
[49] V. B. Carruthers. Host cell invasion by the opportunistic pathogen Toxoplasma gondii .Acta Trop., 81:111–122, 2002.
[50] J. I. Castrillo and S. G. Oliver. Yeast as a Touchstone in Post-genomic Research:Strategies for Integrative Analysis in Functional Genomics. J Biochem Mol Biol.,37:93–106, 2004.
[51] CellML. http://www.cellml.org/.
[52] S. Celniker, D. Wheeler, B. Kronmiller, J. Carlson, A. Halpern, S. Patel, et al. Finish-ing a whole-genome shotgun: Release 3 of the Drosophila melanogaster euchromaticgenome sequence. Genome Biol., 3:research0079.1–0079.14, 2002.
[53] Chagas disease information. The UNICEF-UNDP-World Bank-WHO Special Pro-gramme for Research and Training in Tropical Diseases.http://www.who.int/tdr/diseases/chagas/diseaseinfo.htm.
[54] K. H. Cheung, K. White, and J. Hager. YMD: A microarray database for large-scale gene expression analysis. In Proceedings of the American Medical InformaticsAssociation Annual Symposium, pages 140–144, 2002.
[55] The Chipping Forecast. Supplement to Nat Genet., 21:1–60, 1999.
[56] S. Cho, S. G. Park, D. H. Lee, and B. Chul. Protein-protein Interaction Networks:from Interactions to Networks. J Biochem Mol Biol., 37:45–52, 2004.
[57] D. Christendat, A. Yee, A. Dharamsi, Y. Kluger, A. Savchenko, J. R. Cort, et al.Structural proteomics of an archaeon. Nat Struct Biol., 7:903–909, 2000.
[58] M. Clamp, D. Andrews, D. Barker, P. Bevan, G. Cameron, Y. Chen, et al. Ensembl2002: accommodating comparative genomics. Nucleic Acids Res., 31:38–42, 2003.
[59] J-M. Claverie. What If There Are Only 30,000 Human Genes? Science, 291:1255–1257,2001.
[60] C. E. Clayton. Life without transcriptional control? from fly to man and back again.EMBO J., 21:1881–1888, 2002.
[61] A. M. Cohen, K. Rumpel, G. H. Coombs, and J. M. Wastling. Characterisation ofglobal protein expression by two-dimensional electrophoresis and mass spectrometry:proteomics of Toxoplasma gondii . Int J Parasitol., 32:39–51, 2002.
[62] B. Cooper, N. Sample, M. J. Franklin, G. R. Hjaltason, and M. Shadmon. A FastIndex for Semistructured Data. In Proceedings of 27th International Conference onVery Large Data Bases, pages 341–350, 2001.
[63] Cprogramming.com - Your Resource for C++ Programming.http://www.cprogramming.com/.
[64] F. Crick. Central Dogma of Molecular Biology. Nature, 227:561–563, 1970.
354
[65] Database of Interacting Proteins (DIP). http://dip.doe-mbi.ucla.edu/.
[66] C. J. Date. An Introduction to Database Systems - Volume 1, 6th Edition. Addison-Wesley, 1995. DAT c 95:1 1.Ex.
[67] S. Davidson, J. Crabtree, B. Brunk, J. Schug, V. Tannen, G. C. Overton, and C. J.Stoeckert Jr. K2/Kleisli and GUS: Experiments in integrated access to genomic datasources. IBM Systems Journal, 40(2):512–531, 2001.
[68] S. B. Davidson, G. C. Overton, V. Tannen, and L. Wong. BioKleisli: A Digital Libraryfor Biomedical Researchers. Int. J. on Digital Libraries, 1:36–53, 1997.
[69] T. N. Davis. Protein localization in proteomics. Curr Opin Chem Biol., 8:49–53, 2004.
[72] The DDBJ/EMBL/GenBank Feature Table: Definition.http://www.ebi.ac.uk/embl/Documentation/FT definitions/feature table.html.
[73] S. V. de Avalos, I. J. Blader, M. Fisher, J. C. Boothroyd, and B. A. Burleigh. Immedi-ate/Early Response to Trypanosoma cruzi Infection Involves Minimal Modulation ofHost Cell Transcription. J. Biol. Chem., 277:639–644, 2002.
[74] DeCyderTMpublished by Amersham Biosciences. http://www.apbiotech.com/.
[75] The definition of Document Type Definition (DTD). http://www.w3.org/TR/REC-html40/sgml/dtd.html.
[76] J. DeRisi, L. Penland, P. O. Brown, M. L. Bittner, P. S. Meltzer, M. Ray, Y. Chen,Y. A. Su, and J. M. Trent. Use of a cDNA microarray to analyse gene expressionpatterns in human cancer. Nat Genet., 14:457–460, 1996.
[77] A. Deutsch, M. Fernandez, and D. Suciu. Storing semistructured data with STORED.In Proceedings of the 1999 ACM SIGMOD international conference on Managementof data, pages 431–442, 1999.
[78] M. Diehn, G. Sherlock, G. Binkley, H. Jin, J. C. Matese, and T. Hernandez-Boussard.SOURCE: a unified genomic resource of functional annotations, ontologies, and geneexpression data. Nucleic Acids Res., 31:219–223, 2003.
[79] H. Dlugonska, K. Dytnerska, G. Reichmann, S. Stachelhaus, and H. G. Fischer. To-wards the Toxoplasma gondii proteome: position of 13 parasite excretory antigens on astandardized map of two-dimensionally separated tachyzoite proteins. Parasitol Res.,87:634–637, 2001.
[80] DNA Data Bank of Japan. http://www.ddbj.nig.ac.jp/.
[81] A. Doan, P. Domingos, and A. Levy. Learning Source Descriptions for Data Integration.In Proceedings of the International Workshop on The Web and Databases (WebDB),page Learning Source Descriptions for Data Integration, 2000.
[82] Document Object Model (DOM). http://www.w3.org/DOM/.
[83] A. W. Dowsey, M. J. Dunn, and G. Z. Yang. The role of bioinformatics in two-dimensional gel electrophoresis. Proteomics, 3:1567–1596, 2003.
355
[84] B. Edde, J. Rossier, J-P. LeCaer, F. Desbruyeres, F. Gros, and P. Denoulet. Post-translational glutamylation of alpha-tubulin. Science, 247:83–85, 1990.
[85] R. Edgar, M. Domrachev, and A. E. Lash. Gene Expression Omnibus: NCBI geneexpression and hybridization array data repository. Nucleic Acids Res., 30:207–210,2002.
[86] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis anddisplay of genome-wide expression patterns. Proc Natl Acad Sci U S A., 95:14863–14868, 1998.
[87] N. M. El-Sayed, E. Ghedin, J. Song, A. MacLeod, F. Bringaud, C. Larkin, et al. Thesequence and analysis of Trypanosoma brucei chromosome II. Nucleic Acids Res.,31:4856–4863, 2003.
[88] The Electronic Statistics Textbook.http://www.statsoftinc.com/textbook/stathome.html.
[89] R. A. Elmasri and S. B. Navathe. Fundamentals of Database Systems, 3rd edition.Addison-Wesley, 2000.
[90] EMAP: The Edinburgh Mouse Atlas Project. http://genex.hgu.mrc.ac.uk/.
[91] The EMBL Nucleotide Sequence Database. http://www.ebi.ac.uk/embl/.
[96] Enterprise Architect v 4.1, published by Sparx Systems.http://www.sparxsystems.com.au/.
[97] Entrez, The Life Sciences Search Engine. http://www.ncbi.nih.gov/Entrez/.
[98] Ettan DIGE: Fluorescence 2D Difference Gel Electrophoresis.http://www.amershambiosciences.com/proteomics/dige/.
[99] T. Etzold, A. Ulyanow, and P. Argos. SRS: Information Retrieval System for MolecularBiology Data Banks. Methods Enzymol., 266:114–128, 1996.
[100] eVOC: The Human Gene Expression VOCabulary.http://www.sanbi.ac.za/evoc/.
[101] Extensible Markup Language (XML). http://www.w3c.org/XML/.
[102] J. B. Fenn, M. Mann, C. K. Meng, S. F. Wong, and C. M. Whitehouse. Electrosprayionization for mass spectrometry of large biomolecules. Science, 246:64–71, 1989.
[103] S. B. Ficarro, M. L. McCleland, P. T. Stukenberg, D. J. Burke, M. M. Ross, J. Sha-banowitz, D. F. Hunt, and F. M. White. Phosphoproteome analysis by mass spec-trometry and its application to Saccharomyces cerevisiae. Nat Biotechnol., 20:301–305,2002.
356
[104] T. Fiebig, S. Helmer, C-C. Kanne, G. Moerkotte, J. Neumann, R. Schiele, and T. West-mann. Anatomy of a native XML base management system. VLDB J., 11:292–314,2002.
[105] O. Fiehn, J. Kopka, R. N. Trethewey, and L. Willmitzer. Identification of uncommonplant metabolites based on calculation of elemental compositions using gas chromatog-raphy and quadrupole mass spectrometry. Anal Chem., 72:3573–3580, 2000.
[106] H. I. Field, D. Fenyo, and R. C. Beavis. RADARS, a bioinformatics solution that auto-mates proteome mass spectral analysis, optimises protein identification, and archivesdata in a relational database. Proteomics, 2:36–47, 2002.
[107] S. Fields and O. Song. A novel genetic system to detect protein-protein interactions.Nature, 340:245–246, 1989.
[108] A. Fire, S. Xu, M. K. Montgomery, S. A. Kostas, S. E. Driver, and C. C. Mello. Potentand specific genetic interference by double-stranded RNA in Caenorhabditis elegans.Nature, 391:806–811, 1998.
[109] G. Fischer, S. M. Ibrahim, G. A. Brockmann, J. Pahnke, E. Bartocci, H-J. Thiesen,P. Serrano-Fernandez, and S. Moller. Expressionview: visualization of quantitativetrait loci and gene-expression data in Ensembl. Genome Biol., 4:R77, 2003.
[110] L. Florens, M. P. Washburn, J. D. Raine, R. M. Anthony, M. Grainger, J. D. Haynes,et al. A proteomic view of the Plasmodium falciparum life cycle. Nature, 419:520–526,2002.
[111] D. Florescu and D. Kossmann. Storing and Querying XML Data using an RDMBS.IEEE Data Engineering Bulletin, 22:27–34, 1999.
[112] FlyBase: A Database of the Drosophila Genome. http://www.flybase.org.
[113] R. Fogh, J. Ionides, E. Ulrich, W. Boucher, W. Vranken, J. P. Linge, et al. The CCPNproject: an interim report on a data model for the NMR community. Nat Struct Biol.,9:416–418, 2002.
[114] A. Freier, R. Hofestadt, M. Lange, U. Scholz, and A. Stephanik. BioDataServer: ASQL-based service for the online integration of life science data. In Silico Biol., 2:37–57,2002.
[115] B. Futcher, G. I. Latter, P. Monardo, C. S. McLaughlin, and J. I. Garrels. A samplingof the yeast proteome. Mol Cell Biol., 19:7357–7368, 1999.
[116] M. Gail, U. Gross, andW. Bohne. Transcriptional profile of Toxoplasma gondii -infectedhuman fibroblasts as revealed by gene-array hybridization. Mol Genet Genomics.,265:905–912, 2001.
[117] M. Y. Galperin. The Molecular Biology Database Collection: 2004 update. NucleicAcids Res., 32, Database issue:D3–D22, 2004.
[118] H. Garcia-Molina, J. Ullman, and J. Widom. Database Systems: The Complete Book.Prentice Hall, 2002.
[119] M Gardiner-Garden and T. G. Littlejohn. A comparison of microarray databases.Brief. Bioinformatics, 2:143–158, 2001.
357
[120] A. C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, and A. Bauer. Functionalorganization of the yeast proteome by systematic analysis of protein complexes. Nature,415:141–147, 2002.
[129] S. Gharbi, P. Gaffney, A. Yang, M. J. Zvelebil, R. Cramer, M. D. Waterfield, and J. F.Timms. Evaluation of two-dimensional differential gel electrophoresis for proteomicexpression analysis of a model breast cancer cell system. Mol Cell Proteomics., 1:91–98, 2002.
[130] M. Girolami and R. Breitling. Biologically valid linear factor models of gene expression.Bioinformatics, 20:3021–3033, 2004.
[131] G. V. Gkoutos, P. Murray-Rust, H. S. Rzepa, and M. Wright. Chemical markup, XMLand the World-Wide Web. 3. Toward a signed semantic chemical web of trust. J ChemInf Comput Sci., 41:1124–1130, 2001.
[132] The Global Grid Forum (GGF). http://www.gridforum.org/.
[133] C. A. Goble. The Semantic Web: A Killer App for AI? In Artificial Intelligence:Methodology, Systems, and Applications, 10th International Conference, AIMSA 2002,Varna, Bulgaria, pages 274–278, 2002.
[134] J. Gollub, C. A. Ball, G. Binkley, J. Demeter, D. B. Finkelstein, J. M. Hebert, et al.The Stanford Microarray Database: data access and quality assessment tools. NucleicAcids Res., 31:94–96, 2003.
[135] A. Gorg, C. Obermaier, G. Boguth, A. Harder, B. Scheibe, R. Wildgruber, andW. Weiss. The current state of two-dimensional electrophoresis with immobilized pHgradients. Electrophoresis, 21:1037–1053, 2000.
[136] A. Gorg, W. Postel, and S. Gunther. The current state of two-dimensional electrophore-sis with immobilized pH gradients. Electrophoresis, 9:531–546, 1988.
[137] P. R. Graves and T. A. Haystead. Molecular biologist’s guide to proteomics. MicrobiolMol Biol Rev., 66:39–63, 2002.
[138] T. R. Gruber. A translation approach to portable ontologies. Knowledge Acquisition,5:199–220, 1993.
358
[139] M. E. Guicciardi, J. Deussing, H. Miyoshi, S. F. Bronk, P. A. Svingen, C. Peters,S. H. Kaufmann, and G. J. Gores. Cathepsin B contributes to TNF-alpha-mediatedhepatocyte apoptosis by promoting mitochondrial release of cytochrome c. J ClinInvest., 106:1127–1137, 2000.
[140] K. Gull. The cytoskeleton of trypanosomatid parasites. Annu Rev Microbiol., 53:629–655, 1999.
[141] The GUS 3.0 schema. http://www.gusdb.org/cgi-bin/schemaBrowser.
[142] S. P. Gygi, B. Rist, S. A. Gerber, F. Turecek, M. H. Gelb, and R. Aebersold. Quan-titative analysis of complex protein mixtures using isotope-coded affinity tags. NatBiotechnol., 17:994–999, 1999.
[143] S. P. Gygi, Y. Rochon, B. R. Franza, and R. Aebersold. Correlation between proteinand mRNA abundance in yeast. Mol Cell Biol., 19:1720–30, 1999.
[144] L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar, J. E. Rice, and W. C. Swope.DiscoveryLink: A system for integrated access to life sciences data sources. IBMSystems Journal, 40:489–511, 2001.
[145] J. G. Hacia, L. C. Brody, M. S. Chee, S. P. Fodor, and F. S. Collins. Detectionof heterozygous mutations in BRCA1 using high density oligonucleotide arrays andtwo-colour fluorescence analysis. Nat Genet., 14:441–447, 1996.
[146] N. Hall, M. Berriman, N. J. Lennard, B. R. Harris, C. Hertz-Fowler, E. N. Bart-Delabesse, et al. The DNA sequence of chromosome I of an African trypanosome:gene content, chromosome organisation, recombination and polymorphism. NucleicAcids Res., 31:4864–4873, 2003.
[147] G. J. Hannon. RNA interference. Nature, 418:244–251, 2002.
[148] P. M. Haverty, Z. Weng, N. L. Best, K. R. Auerbach, L. L.i Hsiao, R. V. Jensen,and S. R. Gullans. Hugeindex: a database with visualization tools for high-densityoligonucleotide array data from normal human tissues. Nucleic Acids Res., 30:214–217, 2002.
[149] S. Hennig, D. Groth, and H. Lehrac. Automated gene ontology annotation for anony-mous sequence data. Nucleic Acids Res., 31:3712–3715, 2003.
[150] H. Hermjakob, L. Montecchi-Palazzi, G. Bader, J. Wojcik, L. Salwinski, A. Ceol,et al. The HUPO PSI’s molecular interaction format–a community standard for therepresentation of protein interaction data. Nat Biotechnol., 22:177–183, 2004.
[151] F. Hillenkamp and M. Karas. Mass spectrometry of peptides and proteins by matrix-assisted ultraviolet laser desorption/ionization. Methods Enzymol., 193:280–95, 1990.
[152] Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S. L. Adams, et al. Systematicidentification of protein complexes in Saccharomyces cerevisiae by mass spectrometry.Nature, 415:180–183, 2002.
[153] C. Hoogland, J. C. Sanchez, L. Tonella, P. A. Binz, A. Bairoch, D. F. Hochstrasser,and R. D. Appel. The 1999 SWISS-2DPAGE database update. Nucleic Acids Res.,28:286–288, 2000.
359
[154] I. Horrocks. DAML+OIL: a reason-able web ontology language. In Proceedings ofEDBT 2002, number 2287 in Lecture Notes in Computer Science, pages 2–13, March2002.
[155] M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, et al. The Sys-tems Biology Markup Language (SBML): A Medium for Representation and Exchangeof Biochemical Network Models. Bioinformatics, 19:524–531, 2003.
[156] HUGO Gene Nomenclature Committee (HGNC).http://www.gene.ucl.ac.uk/nomenclature/.
[157] W. K. Huh, J. V. Falvo, L. C. Gerke, A. S. Carroll, R. W. Howson, J. S. Weissman,and E. K. O’Shea. Global analysis of protein localization in budding yeast. Nature,425:686–691, 2003.
[159] E. Hunt, E. Pafilis, I. Tulloch, and J. Wilson. Index-Driven XML Data Integration toSupport Functional Genomics. In Proceeding of the International Workshop on DataIntegration in Life Sciences, Lecture Notes in Computer Science, volume 2994, pages95–109, 2004.
[160] HUP-ML format is available as a DTD (Document Type Definition).http://www1.biz.biglobe.ne.jp/˜jhupo/HUP-ML/hup-ml.dtd.
[161] HUPO - The Human Proteome Organisation. http://www.hupo.org/.
[162] ImageMaster published by Amersham Biosciences. http://www.apbiotech.com/.
[163] Immunohistochemistry - In Situ Hybridization. http://home.no.net/immuno/.
[164] The International Human Genome Sequencing Consortium. Initial sequencing andanalysis of the human genome. Nature, 401:860–921, 2001.
[165] R. Jansen and M. Gerstein. Analysis of the yeast transcriptome with structural andfunctional categories: characterizing highly expressed proteins. Nucleic Acids Res.,28:1481–1488, 2000.
[166] Japanese Human Proteome Organisation (J-HUPO). http://www.jhupo.org/.
[167] Java 2 Platform, Standard Edition (J2SE), v1.4 Overview.http://java.sun.com/j2se/1.4/.
[168] Java Applet. http://java.sun.com/applets/.
[169] Java Technology. http://java.sun.com/.
[170] Java Web Start Technology. http://java.sun.com/products/javawebstart/.
[171] JavaScript.comTM- The Definitive JavaScript Resource.http://www.javascript.com/.
[172] O. N. Jensen. Modification-specific proteomics: characterization of post-translationalmodifications by mass spectrometry. Curr Opin Chem Biol., 8:33–41, 2004.
[173] T. K. Jenssen and E. Hovig. The semantic web and biology. Drug Discov Today.,7:992, 2002.
360
[174] A. Jones. A database for storing the results of 2D-PAGE experiments. Master’s thesis,University of Glasgow, 2001.
[175] A. Jones, E. Hunt, J. M. Wastling, A. Pizarro, and C. J. Stoeckert Jr. An object modeland database for functional genomics. Bioinformatics, 20:1583–1590, 2004.
[176] A. Jones, J. Wastling, and E. Hunt. Proposal for a standard representation of two-dimensional gel electrophoresis data. Comp. Funct. Genom., 4:492–501, 2003.
[177] K. R. Jonscher and J. R. Yates 3rd. The quadrupole ion trap mass spectrometer–asmall solution to a big challenge. Anal Biochem., 244:1–15, 1997.
[178] K. Kadota, D. Tominaga, R. Asai, and K. Takahashi. Correlation Analysis of mRNAand Protein Abundances in Human Tissues. Genome Lett., 2:139–148, 2003.
[179] D. E. Kalume, H. Molina, and A. Pandey. Tackling the phosphoproteome: tools andstrategies. Curr Opin Chem Biol., 7:64–9, 2003.
[180] M. Karas and F. Hillenkamp. Laser desorption ionization of proteins with molecularmasses exceeding 10,000 daltons. Anal Chem., 60:2299–2301, 1988.
[181] N. A. Karp, D. P. Kreil, and K. S. Lilley. Determining a significant change in pro-tein expression with DeCyderTMduring a pair-wise comparison using two-dimensionaldifference gel electrophoresis. Proteomics, 4:1421–1432, 2004.
[182] P. Karp, M. Riley, S. Paley, A. Pellegrini-Toole, and M. Krummenacker. EcoCyc:Electronic Encyclopedia of E. coli Genes and Metabolism. Nucleic Acids Res., 27:55–58, 1999.
[183] P. D. Karp. A strategy for database interoperation. J. Comput. Biol., 2:573–586, 1995.
[184] KEGG: Kyoto Encyclopedia of Genes and Genomes.http://www.genome.ad.jp/kegg/.
[185] K. Kim, D. Soldati, and J. C. Boothroyd. Gene replacement in Toxoplasma gondii withchloramphenicol acetyltransferase as selectable marker. Science, 262:911–914, 1993.
[186] K. Kim and L. M. Weiss. Toxoplasma gondii : the model apicomplexan. Int J Parasitol.,34:423–432, 2004.
[187] J. C. Kissinger, B. Gajria, L. Li, I. T. Paulsen, and D. S. Roos. ToxoDB: accessingthe Toxoplasma gondii genome. Nucleic Acids Res., 31:234–236, 2003.
[188] H. Kitano. Systems Biology: A Brief Overview. Science, 295:1662–1664, 2002.
[189] T. G. Kleno, C. M. Andreasen, H. O. Kjeldal, L. R. Leonardsen, T. N. Krogh, P. F.Nielsen, M. V. Sorensen, and O. N. Jensen. MALDI MS peptide mapping perfor-mance by in-gel digestion on a probe with prestructured sample supports. Anal Chem.,76:3576–3583, 2004.
[190] A. Kumar, P. M. Harrison, K. H. Cheung, N. Lan, N. Echols, P. Bertone, P. Miller,M. B. Gerstein, and M. Snyder. An integrated approach for finding overlooked genesin yeast. Nat Biotechnol., 20:58–63, 2002.
[191] J. Lee, S. Nam, S. B. Hwang, M. Hong, J. Y. Kwon, K. S. Joeng, S. H. Im, J. Shim,and M. C. Park. Functional genomic approaches using the nematode Caenorhabditiselegans as a model system. J Biochem Mol Biol., 37:107–113, 2004.
361
[192] J-H. Lee, D-E. Lee, B-U. Lee, and H-S. Kim. Global Analyses of Transcriptomes andProteomes of a Parent Strain and an L-Threonine-Overproducing Mutant Strain. JBacteriol., 185:5442–5451, 2003.
[193] M. G. Lee. The 3’ untranslated region of the hsp 70 genes maintains the level ofsteady state mRNA in Trypanosoma brucei upon heat shock. Nucleic Acids Res.,26:4025–4033, 1998.
[194] M. L. Lee, L. H. Yang, W. Hsu, and X. Yang. XClust: clustering XML schemas foreffective integration. In Proceedings of the 2002 ACM CIKM International Conferenceon Information and Knowledge Management, McLean, VA, USA, pages 292–299, 2002.
[195] A. J. Link, J. Eng, D. M. Schieltz, E. Carmack, G. J. Mize, D. R. Morris, B. M. Garvik,and J. R. Yates 3rd. Direct analysis of protein complexes using mass spectrometry.Nat Biotechnol., 17:676–682, 1999.
[196] C. M. Lloyd, M. D. B. Halstead, and P. F. Nielsen. CellML: its future, present andpast. Prog. Biophys. Mol. Biol., 85:433–450, 2004.
[197] G. W. Lubega, D. K. Byarugaba DK, and R. K. Prichard. Immunization with atubulin-rich preparation from Trypanosoma brucei confers broad protection againstAfrican trypanosomosis. Exp Parasitol., 102:9–22, 2002.
[198] R. E. Lyons, R. McLeod, and C. W. Roberts. Toxoplasma gondii : tachyzoite tobradyzoite interconversion. Trends Parasitol., 18:198–201, 2002.
[199] Macromedia. http://www.macromedia.com/.
[200] P. Mahon and P. Dupree. Quantitative and reproducible two-dimensional gel analysisusing Phoretix 2D Full. Electrophoresis, 22:2075–2085, 2001.
[201] H. Mamitsuka, Y. Okuno, and A. Yamaguchi. Mining biologically active patterns inmetabolic pathways using microarray expression profiles. ACM SIGKDD ExplorationsNewsletter, 5:113–121, 2003.
[202] E. Manduchi, G. R. Grant, H. He, J. Liu, M. D. Mailman, A. D. Pizarro, P. L.Whetzel, and C. J. Stoeckert Jr. RAD and the RAD Study-Annotator: an approachto collection, organization and exchange of all relevant information for high-throughputgene expression studies. Bioinformatics, 20:452–459, 2004.
[203] M. Mann, R. C. Hendrickson, and A. Pandey. Analysis of proteins and proteomes bymass spectrometry. Annu. Rev. Biochem., 70:437–473, 2001.
[204] M. Mann and O. N. Jensen. Proteomic analysis of post-translational modifications.Nat Biotechnol., 21:255–261, 2003.
[205] A. G. Marshall, C. L. Hendrickson, and G. S. Jackson. Fourier transform ion cyclotronresonance mass spectrometry: a primer. Mass Spectrom Rev., 17:1–35, 1998.
[206] C. J. Marshall. Specificity of receptor tyrosine kinase signaling: transient versus sus-tained extracellular signal-regulated kinase activation. Cell, 80:179–185, 1995.
[207] MASCOT, published by Matrix Science. http://www.matrixscience.com.
[208] M. H. Maurer, C. Berger, M. Wolf, C. D. Futterer, R. E. Feldmann Jr., S. Schwab, andW. Kuschinsky. The proteome of human brain microdialysate. Proteome Sci., 1(7),2003.
362
[209] S. M. Maurer, R. B. Firestone, and C. R. Scriver. Science’s neglected legacy. Nature,405:117–120, 2000.
[210] Melanie3 published by GeneBio. http://www.GeneBio.com/Melanie.html.
[211] The MGED Ontology. http://mged.sourceforge.net/ontologies/MGEDontology.php.
[212] Microarray Gene Expression Data Society (MGED). http://www.mged.org/.
[213] Microsoft .NET Information. http://www.microsoft.com/net/.
[214] O. A. Mirgorodskaya, Y. P. Kozmin, M. I. Titov, R. Korner, C. P. Sonksen, andP. Roepstorff. Quantitation of peptides and proteins by matrix-assisted laser des-orption/ionization mass spectrometry using (18)O-labeled internal standards. RapidCommun Mass Spectrom., 14:1226–1232, 2000.
[215] B. Modrek, A. Resch, C. Grasso, and C. Lee. Genome-wide detection of alternativesplicing in expressed sequences of human genes. Nucleic Acids Res., 29:2850–2859,2001.
[217] M. P. Molloy. Two-Dimensional Electrophoresis of Membrane Proteins Using Immo-bilized pH Gradients. Anal Biochem., 280:1–10, 2000.
[218] The Mouse Anatomical Dictionary Browser.http://www.informatics.jax.org/searches/anatdict form.shtml.
[219] N. J. Mulder, R. Apweiler, T. K. Attwood, A. Bairoch, D. Barrell, A. Bateman, et al.The InterPro Database, 2003 brings increased coverage and new features. NucleicAcids Res., 31:315–318, 2003.
[220] P. Murray-Rust, H. S. Rzepa, M. J. Williamson, and E. L. Willighagen. Chemicalmarkup, XML, and the World Wide Web. 5. Applications of chemical metadata inRSS aggregators. J Chem Inf Comput Sci., 44:462–469, 2004.
[221] MySQL. http://www.mysql.com/.
[222] National Institute for Standards and Technology. http://www.nist.gov.
[223] C. Navarre, H. Degand, K. L. Bennett, J. S. Crawford, E. Mortz, and M. Boutry. Sub-proteomics: Identification of plasma membrane proteins from the yeast Saccharomycescerevisiae. Proteomics, 12:1706–1714, 2002.
[224] The NCBI Taxonomy Homepage.http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/.
[226] W. Ni and T. W. Ling. GLASS: A Graphical Query Language for Semi-StructuredData. In Eighth International Conference on Database Systems for Advanced Applica-tions (DASFAA), pages 363–370, 2003.
[227] J. K. Nicholson, J. Connelly, J. C. Lindon, and E. Holmes. Metabonomics: a platformfor studying drug toxicity and gene function. Nat Rev Drug Discov., 1:153–161, 2002.
363
[228] M. Nilsson. The semantic web: How RDF will change learning technology standards,2001. http://www.cetis.ac.uk/content/20010927172953.
[229] N. Nirmalan, P. F. G. Sims, and J. E. Hyde. Quantitative proteomics of the humanmalaria parasite Plasmodium falciparum and its application to studies of developmentand inhibition. Mol Microbiol., 52:1187–1199, 2004.
[230] N. F. Noy, R. W. Fergerson, and M. A. Musen. The knowledge model of Protege-2000: Combining interoperability and flexibility. In 2th International Conference onKnowledge Engineering and Knowledge Management, pages 17–32, 2001.
[231] The Object Management Group. http://www.omg.org/.
[232] OPD: Open Proteomics Database. http://bioinformatics.icmb.utexas.edu/OPD/.
[233] Open Biological Ontologies (OBO). http://obo.sourceforge.net/.
[234] Open Grid Services Architecture Data Access and Integration (OGSA-DAI).http://www.ogsadai.org.uk/.
[235] Oracle 9i. http://www.oracle.com/.
[236] S. Orchard, P. Kersey, H. Hermjakob, and R. Apweiler. The HUPO Proteomics Stan-dards Initiative meeting: towards common standards for exchanging proteomics data.Comp Funct Genom, 4:16–19, 2003.
[237] S. Orchard, P. Kersey, W. Zhu, L. Montecchi-Palazzi, H. Hermjakob, and R. Apweiler.Progress in establishing common standards for exchanging proteomics data: The sec-ond meeting of the HUPO Proteomics Standards Initiative. Comp Funct Genom,4:203–206, 2003.
[238] OWL Web Ontology Language. http://www.w3.org/TR/owl-features/.
[239] H. Papageorgiou, F. Pentaris, E. Theodoruou, M. Vardaki, and M. Petrakos. Modelingstatistical metadata. In Proceedings of the 13th International Conference on Scientificand Statistical Database Management, pages 25–35, 2001.
[240] G. M. Pasinetti and L. Ho. From cDNA microarrays to high-throughput proteomics.Implications in the search for preventive initiatives to slow the clinical progression ofAlzheimer’s disease dementia. Restor Neurol Neurosci., 18:137–142, 2001.
[241] N. W. Paton, R. Stevens, P. G. Baker, C. A. Goble, S. Bechhofer, and A. Brass. QueryProcessing in the TAMBIS Bioinformatics Source Integration System. In Proceedings11th Int. Conf. on Scientific and Statistical Databases (SSDBM), pages 138–147, 1999.
[242] PEDRo (Proteomics Experiment Data Repository). http://pedro.man.ac.uk/.
[243] J. Peng, J. E. Elias, C. C. Thoreen, L. J. Licklider, and S. P. Gygi. Evaluation ofmultidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J Proteome Res., 2:43–050, 2003.
[244] C. A. Pereira, G. D. Alonso, H. N. Torres, and M. M. Flawia. Arginine kinase: acommon feature for management of energy reserves in African and American flagellatedtrypanosomatids. J Eukaryot Microbiol., 49:82–85, 2002.
364
[245] M. Perrot, F. Sagliocco, T. Mini, C. Monribot, U. Schneider, A. Shevchenko, M. Mann,P. Jeno, and H. Boucherie. Two-dimensional gel protein database of Saccharomycescerevisiae (update 1999). Electrophoresis, 20:2280–2298, 1999.
[257] The Proteomics Standards Initiative. http://psidev.sourceforge.net/.
[258] PSI-MS XML Data Format. http://psidev.sourceforge.net/ms/.
[259] S. Purvine, A. F. Picone, and E. Kolker. Standard mixtures for proteome studies.OMICS, 8:79–92, 2004.
[260] X. Que, H. Ngo, J. Lawton, M. Gray, Q. Liu, J. Engel, et al. The cathepsin B ofToxoplasma gondii, toxopain-1, is critical for parasite invasion and rhoptry proteinprocessing. J Biol Chem., 277:25791–25797, 2002.
[261] The R Project for Statistical Computing. http://www.r-project.org/.
[262] RAD (RNA Abundance Database). http://www.cbil.upenn.edu/RAD/.
[263] J. C. Rain, L. Selig, H. De Reuse, V. Battaglia, C. Reverdy, S. Simon, et al. Theprotein-protein interaction map of Helicobacter pylori. Nature, 409:211–215, 2001.
[264] B. Raman, A. Cheung, and M. R. Marten. Quantitative comparison and evaluationof two commercially available, two-dimensional electrophoresis image analysis softwarepackages, Z3 and Melanie. Electrophoresis, 23:2194–2202, 2002.
[265] W. D. Ransom, P-C. Lao, D. A. Gage, and W. F. Boss. PhosphoglycerylethanolaminePosttranslational Modification of Plant Eukaryotic Elongation Factor 1 α. Plant Phys-iol., 117:949–960, 1998.
[266] Rational Rose 2000e, published by Rational Software.http://www.rational.com/.
365
[267] S. Raychaudhuri, J. Stuart, and R. Altman. Principal components analysis to sum-marize microarray experiments: application to sporulation time series. Pac SympBiocomput., 5:455–66, 2000.
[268] M. Rebhan, V Chalifa-Caspi, J. Prilusky, and D. Lancet. GeneCards: encyclopedia forgenes, proteins and diseases. Weizmann Institute of Science, Bioinformatics Unit andGenome Center (Rehovot, Israel).http://bioinformatics.weizmann.ac.il/cards.
[270] G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, M. Mann, and B. Seraphin. A genericprotein purification method for protein complex characterization and proteome explo-ration. Nat Biotechnol., 17:1030–1032, 1999.
[271] U. Roessner, C. Wagner, J. Kopka, R. N. Trethewey, and L. Willmitzer. Technicaladvance: simultaneous analysis of metabolites in potato tuber by gas chromatography-mass spectrometry. Plant J., 23:131–142, 2000.
[272] M. Rogers, J. Graham, and R. P. Tonge. Using statistical image models for objectiveevaluation of spot detection in two-dimensional gels. Proteomics, 3:879–86, 2003.
[273] D. S. Roos. Bioinformatics–trying to swim in a sea of data. Science, 291:1260–1261,2001.
[274] J. Rumbaugh, I. Jacobson, and G. Booch. The Unified Modeling Language ReferenceManual. Addison Wesley, 1999.
[275] L. H. Saal, C. Troein, J. Vallon-Christersson, S. Gruvberger, A. Borg, and C. Peterson.BioArray Software Environment: A Platform for Comprehensive Management andAnalysis of Microarray Data. Genome Biol., 3:software0003.1–0003.6, 2002.
[276] F. Sanger, G. M. Air, B. G. Barrell, N. L. Brown, A. R. Coulson, C. A. Fiddes, C. A.Hutchison, P. M. Slocombe, and M. Smith. Nucliotide sequence of bacteriophage phiX174 DNA. Nature, 265:687–695, 1977.
[277] V. Santoni, S. Kieffer, D. Desclaux, F. Masson, and T. Rabilloud. Membrane pro-teomics: use of additive main effects with multiplicative interaction model to classifyplasma membrane proteins according to their solubility and electrophoretic properties.Electrophoresis, 21:3329–3344, 2000.
[278] SASHIMI. http://sashimi.sourceforge.net/.
[279] SAX (Simple API for XML). http://sax.sourceforge.net/.
[280] R. A. Sayle and E. J. Milner-White. RasMol: Biomolecular graphics for all. TrendsBiochem Sci., 20:374–376, 1995.
[282] D. G. Schmid, F. D. von der Mulbe, B. Fleckenstein, T. Weinschenk, and G. Jung.Broadband detection electrospray ionization Fourier transform ion cyclotron resonancemass spectrometry to reveal enzymatically and chemically induced deamidation reac-tions within peptides. Anal Chem., 73:6008–6013, 2001.
366
[283] A. Schneider, U. Plessmann, and K. Weber. Subpellicular and flagellar microtubulesof Trypanosoma brucei are extensively glutamylated. J Cell Sci., 110:431–437, 1997.
[284] L. V. Schneider and M. P. Hall. Stable Isotope Methods for High-Precision Proteomics.Drug Discov Today., in press, 2005.
[285] J. Seo and K-J. Lee. Post-translational modifications and their biological functions:Proteomic analysis and systematic approaches. J Biochem Mol Biol., 37:35–44, 2004.
[286] The Sequence Ontology Project. http://song.sourceforge.net/.
[287] D. Shalon, S. J. Smith, and P. O. Brown. A DNA microarray system for analyzingcomplex DNA samples using two-color fluorescent probe hybridization. Genome Res.,6:639–645, 1996.
[288] J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton.Relational Databases for Querying XML Documents: Limitations and Opportunities.In Proceedings of 25th International Conference on Very Large Data Bases, pages 302–314, 1999.
[289] T. Sherwin, A. Schneider, R. Sasse, T. Seebeck, and K. Gull. Distinct localization andcell cycle dependence of COOH terminally tyrosinolated alpha-tubulin in the micro-tubules of Trypanosoma brucei brucei . J Cell Biol., 104:439–446, 1987.
[290] Y. Shi, R. Xiang, C. Horvath, and J. A. Wilkins. The role of liquid chromatographyin proteomics. J Chromatogr A., 1053:27–36, 2004.
[291] L. D. Sibley. Intracellular Parasite Invasion Strategies. Science, 304:248–253, 2004.
[292] A. P. Sinai, T. M. Payne, J. C. Carmen, L. Hardi, S. J. Watson, and R. E. Molestina.Mechanisms underlying the manipulation of host apoptotic pathways by Toxoplasmagondii . Int J Parasitol., 34:381–391, 2004.
[293] Sir Henry Wellcome Functional Genomics Facility (SHWFGF), based in the Universityof Glasgow. http://www.gla.ac.uk/functionalgenomics/.
[294] D. H. Smith, J. Pepin, and A. H. Stich. Human African trypanosomiasis: an emergingpublic health crisis. Br Med Bull., 54:341–355, 1998.
[295] W. Smyth. Computing Patterns in Strings. Addison-Wesley, 2003.
[296] SourceForge.net: Project Info - Life Science Identifier (LSID).http://sourceforge.net/projects/lsid/.
[297] P. T. Spellman, M. Miller, J. Stewart, C. Troup, U. Sarkans, S. Chervitz, et al. Designand implementation of microarray gene expression markup language (MAGE-ML).Genome Biol., 23, 2002. RESEARCH0046.
[298] Standards and Ontologies for Functional Genomics. http://www.sofg.org/.
[299] L. D. Stein. Integrating biological databases. Nat Rev Genet., 4:337–345, 2003.
[300] R. D. Stevens, A. J. Robinson, and C. A. Goble. myGrid: personalised bioinformaticson the information grid. Bioinformatics, 19:I302–I304, 2003.
[301] A. Stich, P. M. Abel, and S. Krishna. Human African trypanosomiasis. BMJ, 325:203–206, 2002.
367
[302] C. Stoeckert, A. Pizarro, E. Manduchi, M. Gibson, B. Brunk, J. Crabtree, J. Schug,S. Shen-Orr, and G. C. Overton. A relational schema for both array-based and SAGEgene expression experiments. Bioinformatics, 417:300–308, 2001.
[303] C. J. Stoeckert, H. C. Causton, and C. A. Ball. Microarray databases: standards andontologies. Nat Genet., 32:469–473, 2002.
[304] C. J. Stoeckert and H. Parkinson. The MGED ontology: a framework for describingfunctional genomics experiments. Comp. Funct. Genom., 4:127–132, 2003.
[305] E. C. Strauss, J. A. Kobori, G. Siu, and L. E. Hood. Specific-primer-directed DNAsequencing. Anal Biochem., 154:353–360, 1986.
[306] L. W. Sumner, P. Mendes, and R. A. Dixon. Plant Metabolomics: Large-scale Phyto-chemistry in the Functional Genomics Era. Phytochemistry, 62:817–836, 2003.
[307] Sun Microsystems, Inc. http://www.sun.com/.
[308] Y. H. Sung, J. Song, and H-W. Lee. Functional Genomics Approach Using Mice. JBiochem Mol Biol., 37:122–132, 2004.
[309] SWISS-2DPAGE: Two-dimensional polyacrylamide gel electrophoresis database.http://ca.expasy.org/ch2d/.
[310] Swiss-Prot. http://www.expasy.ch/sprot/.
[311] The Systems Biology Markup Language. http://sbml.org/.
[312] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. Lander,and T. Golub. Interpreting gene expression with self-organizing maps: Methods andapplication to hematopoietic differentiation. Proc Natl Acad Sci U S A., 96:2907–2912,1999.
[313] Tamino XML server. http://www.softwareag.com/tamino/.
[314] T. A. Tatusova, L. Karsch-Mizrachi, and J. A. Ostell. Complete genomes in WWWEntrez: data representation and analysis. Bioinformatics, 15:536–543, 1999.
[315] C. F. Taylor, N. W. Paton, K. L. Garwood, P. D. Kirby, D. A. Stead, Z. Yin, et al.A systematic approach to modeling, capturing, and disseminating proteomics experi-mental data. Nat. Biotechnol., 21:247–254, 2003.
[316] S. W. Taylor, E. Fahy, B. Zhang, G. M. Glenn, D. E. Warnock, S. Wiley, et al.Characterization of the human heart mitochondrial proteome. Nat Biotechnol., 21:281–286, 2003.
[317] D. E. Terry and D. M. Desiderio. Between-gel reproducibility of the human cere-brospinal fluid proteome. Proteomics, 3:3, 2003.
[318] J. D. Thompson, D. G. Higgins, and T. J. Gibson. CLUSTAL W: improving the sensi-tivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22:4673–4680,1994.
[319] P. Toronen, M. Kolehmainen, G. Wong, and E. Castren. Analysis of gene expressiondata using self-organizing maps. FEBS, 451:142–146, 1999.
368
[320] ToxoDB : The Toxoplasma Genome Resource. http://www.toxodb.org/.
[322] M. Tyers and M. Mann. From genomics to proteomics. Nature, 422:193–197, 2003.
[323] P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, et al.A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae.Nature, 403:623–627, 2000.
[325] UniParc, The UniProt Archive. http://www.ebi.ac.uk/uniparc/.
[326] UniProt (Universal Protein Resource). http://www.uniprot.org.
[327] M. Unlu, M. E. Morgan, and J. S. Minden. Difference gel electrophoresis: a single gelmethod for detecting changes in cell extracts. Electrophoresis, 18:2071–2077, 1997.
[328] G. Van den Bergh, S. Clerens, F. Vandesande, and L. Arckens. Reversed-phase high-performance liquid chromatography prefractionation prior to two-dimensional differ-ence gel electrophoresis and mass spectrometry identifies new differentially expressedproteins between striate cortex of kitten and adult cat. Electrophoresis, 24:1471–1481,2003.
[329] F. J. van Deursen, S. K. Shahi, C. M. Turner, C. Hartmann, C. Guerra-Giraldez, K. R.Matthews, and C. E. Clayton. Characterisation of the growth and differentiation invivo and in vitro-of bloodstream-form Trypanosoma brucei strain TREU 927. MolBiochem Parasitol., 112:163–171, 2001.
[330] F. J. van Deursen, D. J. Thornton, and K. R. Matthews. A reproducible protocol foranalysis of the proteome of Trypanosoma brucei by 2-dimensional gel electrophoresis.Mol Biochem Parasitol., 128:107–110, 2003.
[331] S. Veeser, M. J. Dunn, and G. Z. Yang. Multiresolution image registration for two-dimensional gel electrophoresis. Proteomics, 1:856–870, 2001.
[332] V. E. Velculescu, L. Zhang, B. Vogelstein, and K. W. Kinzler. Serial analysis of geneexpression. Science, 270:484–487, 1995.
[333] V. E. Velculescu, L. Zhang, W. Zhou, J. Vogelstein, M. A. Basrai, D. E. Bassett Jr,P. Hieter, B. Vogelstein, and K. W. Kinzler. Characterization of the yeast transcrip-tome. Cell, 88:243–251, 1997.
[334] J. C. Venter, M. D. Adams, and E. W. Myers. The Sequence of the Human Genome.Science, 291:1304–1351, 2001.
[335] K. Vickerman. On the surface coat and flagellar adhesion in trypanosomes. Cell Sci.,5:163–194, 1969.
[336] E. O. Voit. Metabolic modeling: a tool of drug discovery in the post-genomic era.Drug Discov. Today, 7:621–628, 2002.
[337] C-W. von der Lieth, A. Bohne-Lang, K. K. Lohmann, and M. Frank. Bioinformaticsfor glycomics: Status, methods, requirements and perspectives. Brief. Bioinformatics,5:164–178, 2004.
369
[338] T. Voss and P. Haberl. Observations on the reproducibility and matching efficiencyof two-dimensional electrophoresis gels: consequences for comprehensive data analysis.Electrophoresis, 21:3345–3350, 2000.
[339] Voyager Version 5 with Data Explorer Software, published by Applied Biosystems.http://www.appliedbiosystems.com/.
[340] W3C Math home page. http://www.w3.org/Math/.
[341] W3C Recommendation for XML Schema. http://www.w3.org/XML/Schema.
[343] A. J. Walhout, R. Sordella, X. Lu, J. L. Hartley, G. F. Temple, M. A. Brasch,N. Thierry-Mieg, and M. Vidal. Protein interaction mapping in C. elegans usingproteins involved in vulval development. Science, 287:116–122, 2000.
[344] M. P. Washburn, D. Wolters, and J. R. Yates III. Large-scale analysis of the yeast pro-teome by multidimensional protein identification technology. Nat Biotechnol., 19:242–247, 2001.
[345] V. C. Wasinger, S. J. Cordwell, A. Cerpa-Poljak, J. X. Yan, A. A. Gooley, M. R.Wilkins, M. W. Duncan, R. Harris, K. L. Williams, and I. Humphery-Smith. Progresswith gene-product mapping of the Mollicutes: Mycoplasma genitalium. Electrophoresis,16:1090–1094, 1995.
[346] W. Weckwerth. Metabolomics in systems biology. Annu Rev Plant Biol., 54:669–689,2003.
[347] W. Weckwerth, V. Tolstikov, and O. Fiehn. Metabolomic characterization of transgenicpotato plants using GC/TOF and LC/MS analysis reveals silent metabolic phenotypes.In Proceedings of the 49th ASMS Conference on Mass spectrometry and Allied Topics,pages 1–2. Chicago: Am. Soc. Mass Spectrom., 2001.
[348] G. Wiederhold. Intelligent integration of diverse information (invited talk). In Int.Conf. on Information and Knowledge Management, Baltimore, 1992.
[349] M. R. Wilkins, J. C. Sanchez, A. A. Gooley, R. D. Appel, I. Humphery-Smith, D. F.Hochstrasser, and K. L. Williams. Progress with proteome projects: why all proteinsexpressed by a genome should be identified and how to do it. Biotechnol Genet EngRev., 13:19–50, 1996.
[350] WordNet - a lexical database for the English language.http://www.cogsci.princeton.edu/˜wn/.
[351] WORLD-2DPAGE: Index to 2-D PAGE databases and services.http://us.expasy.org/ch2d/2d-index.html.
[352] The World Wide Web Consortium. http://www.w3c.org/.
[353] WormBase. http://www.wormbase.org/.
[354] W. Xhou, B. A. Merrick, M. G. Khaledi, and K. B. Tomer. Detection and sequencingof phosphopeptides affinity bound to immobilized metal ion beads by matrix-assistedlaser desorption/ionization mass spectrometry. J Am Soc Mass Spectrom., 11:273–282,2000.
370
[355] S. Xirasagar, S. Gustafson, A. Merrick, K. B. Tomer, S. Stasiewicz, D. D. Chan,et al. CEBS Object Model for Systems Biology Data, CEBS MAGE SysBio-OM.Bioinformatics, 20:2004–2015, 2004.
[356] XML Metadata Interchange (XMI).http://www.omg.org/technology/documents/formal/xmi.htm.
[357] XQuery 1.0: An XML Query Language. http://www.w3.org/TR/xquery/.
[358] XSPAN - A Cross-Species Anatomy Project. http://www.xspan.org/.
[359] Xtect. http://xtect.cis.strath.ac.uk/.
[360] A. F. Yakunin, A. A. Yee, A. Savchenko, A. M. Edwards, and C. H. Arrowsmith.Structural proteomics: a tool for genome annotation. Curr Opin Chem Biol., 8:42–48,2004.
[361] W. Yan, H. Lee, E. C. Yi, D. Reiss, P. Shannon, B. K. Kwieciszewski, et al. System-based proteomic analysis of the interferon response in human liver cells. Genome Biol.,5:R54, 2004.
[362] M. Yanagida. Functional proteomics; current achievements. J Chromatogr B AnalytTechnol Biomed Life Sci., 771:89–106, 2002.
[363] X. Yang, M. L. Lee, and T. W. Ling. Resolving Structural Conflicts in the Integra-tion of XML Schemas: A Semantic Approach. In 22nd International Conference onConceptual Modeling (ER), pages 520–533, 2003.
[364] M. Yoshikawa, T. Amagasa, T. Shimura, and S. Uemura. XRel: a path-based ap-proach to storage and retrieval of XML documents using relational databases. ACMTransactions on Internet Technology, 1:110–141, 2001.
[365] N. Young, Z. Chang, and D. S. Wishart. GelScape: a web-based server for inter-actively annotating, manipulating, comparing and archiving 1D and 2D gel images.Bioinformatics, 20:976–978, 2004.
[366] Z3 published by Compugen. http://www.2dgels.com/.
[367] A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G. Ausiello, M. Helmer-Citterich,and G. Cesareni. MINT: a Molecular INTeraction database. FEBS Lett., 513:135–140,2002.
[368] B. R. Zeeberg, W. Feng, G. Wang, M. D. Wang, A. T. Fojo, M. Sunshine, et al.GoMiner: A Resource for Biological Interpretation of Genomic and Proteomic Data.Genome Biol., 4:R28, 2003.
[369] R. Zeng, H. Q. Ruan, X. S. Jiang, H. Zhou, L. Shi, L. Zhang, Q. H. Sheng, Q. Tu, Q. C.Xia, and J. R. Wu. Proteomic analysis of SARS associated coronavirus using two-dimensional liquid chromatography mass spectrometry and one-dimensional sodiumdodecyl sulfate-polyacrylamide gel electrophoresis followed by mass spectroemtric anal-ysis. J Proteome Res., 3:549–555, 2004.
[370] X. Zuo and D. W. Speicher. Comprehensive analysis of complex proteomes usingmicroscale solution isoelectrofocusing prior to narrow pH range two-dimensional elec-trophoresis. Proteomics, 2:58–68, 2002.