The Development of Data Standards and a Database to Aid ...ajones/JonesThesis.pdf · The Development of Data Standards and a Database to Aid Proteomic Research Andrew Jones Submitted

Department of Computing Science,

and the Division of Infection and Immunity,

Institute of Biomedical and Life Sciences

The Development of Data Standards and a

Database to Aid Proteomic Research

Andrew Jones

Submitted for the degree of Doctor of

Philosophy in Computing Science at

the University of Glasgow

October 2004

c© 2004, Andrew Jones

Abstract

The thesis reports new developments in the area of database support for proteomics exper-

iments. We have developed a proposal for a data standard that will facilitate sharing and

archival of data. We have also developed a database implementing the standard, which is a

prototype of a public repository capable of storing large volumes of data. Our technology

allows for the integration of results from both microarrays and proteomics. The database

has been evaluated in the context of two investigations performed by collaborating biolo-

gists. We have demonstrated that our technology enables the discovery of new results by

facilitating complex queries and providing novel visualisations of experimental data.

i

Thesis statement

This work will highlight the requirements of proteomic research for standard formats and

centralised databases that allow results to be well annotated and queried. We have developed

a proposal for a data standard, and a prototype of a public repository, and the thesis will

demonstrate how they facilitate the research process.

ii

Declaration

I declare that this thesis describes my own work, that it has not been accepted in a pre-

vious application for a degree, and that all sources of information have been specifically

acknowledged. The work reported in Chapters 4 (FGE-OM) and 5 (RAPAD) was initiated

during a two week period I spent at the Computational Biology and Informatics Laboratory,

University of Pennsylvania working with Prof. Chris Stoeckert and Angel Pizarro. During

the two weeks, the framework for FGE-OM was developed and the SQL database schema

for RAPAD was designed. The subsequent development of RAPAD, including refinements

to the schema, the creation of the web interface and software for data visualisation, was

performed by myself at the University of Glasgow.

Chapter 3 contains a revised version of material published in [176]. The material in Chapter

4 has been revised from [175].

Andrew Jones

iii

Thesis Overview

There is a new research paradigm in molecular biology in which large data sets are obtained

about genes and proteins, and the results enable researchers to formulate new hypotheses

about the system they are studying. This methodology is reversed from the classical approach

where an experiment is designed to test a hypothesis. The field of research is collectively

known as functional genomics, as researchers attempt to assign functions to all genes that

can be discovered in the genome sequence. The experiments can also give insights into the

factors that are crucial in particular processes, such as disease, by discovering the differences

between results from a diseased sample and a normal sample. The methods that investigate

protein abundance, interactions and localisation on a large scale are known as proteomics.

Proteomic investigations present significant computational challenges because data sets are

very large and contain heterogeneous information from different laboratories, which could

be useful to researchers working in a variety of domains. The thesis will describe proposals

for data standards for proteomics, and a new relational database, which will alleviate some

of the computational challenges presented by the experiments. The proposals for a standard

should ensure that proteome data can be archived and will be accessible to querying in the

future.

Chapter 1 will describe the experimental techniques of functional genomics, three case

studies of proteomic research and the requirements for central databases and standardisation.

There has been significant work in both bioinformatics, and computing science research, to

improve methods for making data accessible and open to a wide range of queries, which will

lead to the next generation of the Web. Chapter 2 will focus on the new developments in

computing science, and will cover previous work on data standards for life sciences that allow

information to be exchanged between research groups and deposited in central databases.

There are a large number of databases for functional genomics that have different capabilities

and access methods. The chapter will present the challenges in data integration that arise

from the number of different systems that exist. An area that has attracted much recent

iv

v

attention in computing science is ontology development. Ontologies are structured controlled

vocabularies of terms with definitions that describe a domain in a way that ensures there is

a shared understanding of the concepts by different people. An ontology can also contain

rules associated to the terms that allow computer systems to ask logical questions of the

relationships between different parts of a data set. Chapter 2 will describe the ontologies

that currently exist for life sciences.

Chapter 3 will focus on the standardisation of data formats for proteomics. There will

be a description of the previous work in this area, which consists of an object model 1 to

describe the experimental methodology. We have developed an alternative proposal for a

data standard, which was released in October 2003 to describe additional information that

should be captured in a community standard. It is essential that the finalised standard

contains sufficient description of the results, and the methods that were used to obtain data,

to ensure that future re-evaluation and statistical analysis is possible. The chapter will

describe our proposal and will give an overview of the current progress towards a community

accepted standard for proteomics.

There is an established data standard for gene expression studies using microarrays. It

is becoming feasible for researchers to perform both proteomic and microarray investiga-

tions on the same starting samples. In other cases, the results from different investigations

using microarray or proteomic techniques could be integrated, leading to a much better un-

derstanding about the genes and proteins that are important in the sample conditions. We

believe that microarray and proteomic data sets could be integrated more easily, and queried

in parallel, if they have a single shared data standard. Therefore, we have integrated the

microarray standard with the current models of proteomic data to form a single proposal for

a data standard, known as FGE-OM (Functional Genomics Experiment - Object Model),

which will be described in Chapter 4. Chapter 4 will also contain a discussion of the impor-

tance of using ontologies to describe the experimental protocols, to allow future comparison

and querying of different data sets.

We have developed a database for storage of proteomic results, experimental protocols

and details of the biological samples on which the experiments were performed, known as

RAPAD (RNA And Protein Abundance Database), which will be described in Chapter

5. RAPAD is an extension of a microarray database system developed at the University

of Pennsylvania. We have extended a microarray database into proteomics because we

1An object model is a platform independent notation for describing a software system. The importanceof object models for developing data standards will be described in Chapter 2.

vi

hypothesise that data integration across the two fields will be facilitated if the technologies

are captured in a shared database schema and they have a similar user interface. There is

a very close correspondence between FGE-OM and RAPAD, described in Chapter 5, which

allows RAPAD to be used to test that FGE-OM correctly captures the data semantics.

RAPAD also acts as a prototype of a public repository, and demonstrates that proteome

data can be visualised and queried in complex ways using real data sets. Two investigations

are supported by the current implementation of RAPAD, which will be described in Chapters

6 and 7. The investigations allow the core facilities of the database to be evaluated.

Chapter 6 will describe how the database assists an investigation performed in the labo-

ratory of Dr Jonathan Wastling at the Institute of Biomedical and Life Sciences, University

of Glasgow. The investigation aims to discover the proteins that are differentially expressed

in a human cell culture when invaded with the intracellular parasite Toxoplasma gondii, com-

pared with non-invaded cells. The results will enable a better understanding of host-parasite

interactions. The chapter will demonstrate how gene expression and protein abundance

values have been compared in practice.

There will be a description of another project at the Institute of Biomedical and Life

Sciences, which is supported by RAPAD, in Chapter 7. The project is attempting to cat-

alogue all the expressed proteins in the disease-causing parasite Trypanosoma brucei, using

a variety of experimental techniques. The genome sequence is nearing completion but the

level of functional annotation is poor. The proteome catalogue facilitates the genome an-

notation, and the experiments give insights into the dynamic nature of proteins within the

system. Chapter 7 will describe visualisation software written by the author that allows new

conclusions to be drawn from the results.

Chapter 8 will summarise and extend our arguments on standardisation, ontologies and

archiving of data in public repositories. There will be a comparison of our approach with

alternative methods that could have been employed. There will be a description of the work

that is still required to solve the research challenges that follow directly from the thesis, and

a summary of our contribution.

There are four appendices at the end of the thesis. The first, Appendix A, will describe

an investigation performed by the author into indexing large collections of biological data

represented in Extensible Markup Language (XML), as an alternative to relational database

storage. Appendix B contains detailed diagrams of FGE-OM, which supplement the work

presented in Chapter 4. The RAPAD database schema is included in Appendix C. Finally,

vii

Appendix D will describe how difference gel electrophoresis data can be represented in Gla-

PSI, FGE-OM and RAPAD.

Acknowledgements

I give thanks to my supervisors Ela Hunt and Jonathan Wastling. Throughout my PhD, Ela

has given me great support, spending inordinate lengths of time discussing ideas, reading

my work, and giving me encouragement to persevere with my ideas. At the outset of my

research, Jonathan’s enthusiasm was infectious, which gave me great interest in the subject.

I am very grateful to the MRC for funding my research through first an MRes degree, and

then the PhD.

I would like to thank Chris Stoeckert for giving me the opportunity to visit his lab in

Philadelphia, and thanks to Angel Pizarro for giving up so much of his time while I was

there. The time spent in Philadelphia provided a big impetus for my work, for which I am

very grateful. My thanks also to Mike Turner for giving valuable feedback on my work. I give

thanks to Morag Nelson and Anne Faldas, who generated the data I have used in Chapters

6 and 7, for taking time to explain their experiments, for trying out all my software and for

appearing interested when I talk about databases!

Finally, my biggest thanks to my partner Clare, for all her love and support.

viii

Contents

1 Investigations in Functional Genomics 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Experimental methodology . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Systems biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Proteomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Gel based proteomics . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2 Mass spectrometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.2.3 Other proteomics techniques . . . . . . . . . . . . . . . . . . . . . . . 161.2.4 Post-translational modifications . . . . . . . . . . . . . . . . . . . . . . 221.2.5 Case studies of proteomics research . . . . . . . . . . . . . . . . . . . . 241.2.6 Case study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241.2.7 Case study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251.2.8 Case study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261.2.9 Publication of proteomics data . . . . . . . . . . . . . . . . . . . . . . 28

1.3 Gene expression techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281.3.1 The development of microarrays . . . . . . . . . . . . . . . . . . . . . 291.3.2 Serial analysis of gene expression . . . . . . . . . . . . . . . . . . . . . 31

1.4 Other techniques used in functional genomics . . . . . . . . . . . . . . . . . . 311.4.1 RNA interference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311.4.2 Immunohistochemistry . . . . . . . . . . . . . . . . . . . . . . . . . . . 321.4.3 Metabolomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321.4.4 Protein interaction studies . . . . . . . . . . . . . . . . . . . . . . . . . 331.4.5 Three dimensional structures . . . . . . . . . . . . . . . . . . . . . . . 35

1.5 Investigations across the “omics” . . . . . . . . . . . . . . . . . . . . . . . . . 361.5.1 Comparative studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2 Databases, standards and ontologies for the life sciences 402.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.1.1 Computational support for the life sciences . . . . . . . . . . . . . . . 402.1.2 The future accessibility of data . . . . . . . . . . . . . . . . . . . . . . 412.1.3 Guide to the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.2 Technology required for data standards . . . . . . . . . . . . . . . . . . . . . 442.2.1 Extensible Markup Language: XML . . . . . . . . . . . . . . . . . . . 442.2.2 Resource Description Framework . . . . . . . . . . . . . . . . . . . . . 462.2.3 DAML+OIL and the Web Ontology Language . . . . . . . . . . . . . 472.2.4 Unified Modeling Language . . . . . . . . . . . . . . . . . . . . . . . . 482.2.5 The object management group . . . . . . . . . . . . . . . . . . . . . . 49

ix

x

2.3 Data standards in the life sciences . . . . . . . . . . . . . . . . . . . . . . . . 502.3.1 Microarray standards . . . . . . . . . . . . . . . . . . . . . . . . . . . 502.3.2 PEDRo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.3.3 PSI-OM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.3.4 Mass spectrometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.3.5 Protein interaction standards . . . . . . . . . . . . . . . . . . . . . . . 562.3.6 Other data standards in life sciences . . . . . . . . . . . . . . . . . . . 56

2.4 Databases for life sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582.4.1 Microarray databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 592.4.2 Proteomics databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 622.4.3 Other Databases for Life Sciences . . . . . . . . . . . . . . . . . . . . . 63

2.5 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642.5.1 Software for developing ontologies . . . . . . . . . . . . . . . . . . . . 652.5.2 Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652.5.3 MGED Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682.5.4 Other ontologies in life sciences . . . . . . . . . . . . . . . . . . . . . . 692.5.5 The Grid and data integration . . . . . . . . . . . . . . . . . . . . . . 702.5.6 Data standards and ontologies in other fields . . . . . . . . . . . . . . 71

2.6 Data integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712.6.1 Federation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722.6.2 Warehouses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722.6.3 Mediator approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732.6.4 Schema integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3 An object model for proteomics 793.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.1.1 The emergence of proteomics . . . . . . . . . . . . . . . . . . . . . . . 793.1.2 Publication of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.1.3 A central repository for proteomics . . . . . . . . . . . . . . . . . . . . 813.1.4 The status of proteomics standards . . . . . . . . . . . . . . . . . . . . 82

3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843.3 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.3.1 SWISS-2DPAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873.3.2 GELBANK and HUP-ML . . . . . . . . . . . . . . . . . . . . . . . . . 873.3.3 PEDRo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.4 Gla-PSI: A model for 2-D gel electrophoresis and analysis . . . . . . . . . . . 923.4.1 Overview of the experiment and protein extraction . . . . . . . . . . . 923.4.2 Two-dimensional gel electrophoresis . . . . . . . . . . . . . . . . . . . 923.4.3 Image analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943.4.4 Protein spots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 953.4.5 Two-dimensional difference gel electrophoresis . . . . . . . . . . . . . . 963.4.6 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973.4.7 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3.5 Future developments in proteomics standards . . . . . . . . . . . . . . . . . . 983.5.1 An overview of PSI-OM . . . . . . . . . . . . . . . . . . . . . . . . . . 993.5.2 Data model in PSI-OM . . . . . . . . . . . . . . . . . . . . . . . . . . 1013.5.3 An ontology for proteomics . . . . . . . . . . . . . . . . . . . . . . . . 1023.5.4 Minimum information about a proteomics experiment . . . . . . . . . 103

xi

3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033.6.1 Web access to date . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033.6.2 Status of proteome standards . . . . . . . . . . . . . . . . . . . . . . . 104

3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4 Development of a data standard for functional genomics 1074.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.1.1 Requirements for standards . . . . . . . . . . . . . . . . . . . . . . . . 1074.1.2 Status of standardisation . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114.2.1 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.3 Overview of FGE-OM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1164.3.1 BioOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1164.3.2 ArrayOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1174.3.3 ProteomicsOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1184.3.4 A workflow for proteomics . . . . . . . . . . . . . . . . . . . . . . . . . 124

4.4 Other work: CEBS object model for systems biology data . . . . . . . . . . . 1244.4.1 SysBio-OM data model . . . . . . . . . . . . . . . . . . . . . . . . . . 1274.4.2 SysBio-OM Protocol and BioMaterial packages . . . . . . . . . . . . . 1314.4.3 SysBio-OM BioAssay and SummaryData packages . . . . . . . . . . . 131

4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1344.5.1 FGE-OM, SysBio-OM and future standards . . . . . . . . . . . . . . . 1354.5.2 Developments to MAGE-OM . . . . . . . . . . . . . . . . . . . . . . . 1364.5.3 Integrated standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5 A prototype public database for proteomics 1405.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.1.1 Extending existing technology . . . . . . . . . . . . . . . . . . . . . . . 1405.1.2 The development of RAPAD . . . . . . . . . . . . . . . . . . . . . . . 1435.1.3 Chapter guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

5.2 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1455.2.1 GUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1455.2.2 Proteomics database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1465.2.3 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.3 Systems and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1475.3.1 Schema development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1475.3.2 Interface development . . . . . . . . . . . . . . . . . . . . . . . . . . . 1475.3.3 Data integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1485.3.4 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1515.3.5 Unique identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1535.4.1 Data privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1565.4.2 Studies, protocols and contact details . . . . . . . . . . . . . . . . . . 1565.4.3 Protein separations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1585.4.4 2-D gel data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.4.5 Mass spectrometry and external databases . . . . . . . . . . . . . . . . 1635.4.6 RAPAD Querier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1635.4.7 Public data access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1695.4.8 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

xii

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1715.5.1 A prototype of a central repository . . . . . . . . . . . . . . . . . . . . 1715.5.2 The relationship between FGE-OM and RAPAD . . . . . . . . . . . . 1725.5.3 Support for current proteome studies . . . . . . . . . . . . . . . . . . . 1735.5.4 Future developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

6 Database support for proteomic studies of host-parasite interactions 1786.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

6.1.1 Host-parasite interactions . . . . . . . . . . . . . . . . . . . . . . . . . 1786.1.2 Genomic investigation of Toxoplasma . . . . . . . . . . . . . . . . . . 1796.1.3 Microarray analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1806.1.4 Support for proteome studies . . . . . . . . . . . . . . . . . . . . . . . 1816.1.5 Project status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1836.2.1 Display of protein data from different gels . . . . . . . . . . . . . . . . 1846.2.2 Comparison of protein and gene expression data . . . . . . . . . . . . 1856.2.3 Functional classification of proteins . . . . . . . . . . . . . . . . . . . . 188

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1916.3.1 Visualisation of differential expression . . . . . . . . . . . . . . . . . . 1916.3.2 Functional annotation of proteins . . . . . . . . . . . . . . . . . . . . . 1936.3.3 Comparison with microarray data . . . . . . . . . . . . . . . . . . . . 1976.3.4 Post-translational modifications . . . . . . . . . . . . . . . . . . . . . . 2076.3.5 Public access to data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2096.5 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

7 Software support for a proteome map of Trypanosoma brucei 2147.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

7.1.1 The biology of trypanosomes . . . . . . . . . . . . . . . . . . . . . . . 2147.1.2 Annotating the genome . . . . . . . . . . . . . . . . . . . . . . . . . . 2167.1.3 Database support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2177.1.4 Project status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

7.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2197.2.1 Generation of samples for proteome analysis . . . . . . . . . . . . . . . 2197.2.2 Project requirements capture . . . . . . . . . . . . . . . . . . . . . . . 2207.2.3 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2237.3.1 Investigation into multiple protein forms . . . . . . . . . . . . . . . . . 2237.3.2 Using data in RAPAD to improve genome annotation . . . . . . . . . 2367.3.3 Search for post-translational modifications . . . . . . . . . . . . . . . . 2417.3.4 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2447.4.1 Improving the annotation of genes . . . . . . . . . . . . . . . . . . . . 2467.4.2 Visualisation issues in the life sciences . . . . . . . . . . . . . . . . . . 2477.4.3 Analysis of modifications . . . . . . . . . . . . . . . . . . . . . . . . . 2487.4.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

xiii

8 Future work, discussion and conclusions 2548.1 Summary of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2548.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

8.2.1 Alternative approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 2558.2.2 Digital archiving and publication of life science data . . . . . . . . . . 2588.2.3 The role of data standards . . . . . . . . . . . . . . . . . . . . . . . . . 2608.2.4 A functional genomics standard . . . . . . . . . . . . . . . . . . . . . . 2618.2.5 Proteomics standards . . . . . . . . . . . . . . . . . . . . . . . . . . . 2628.2.6 A vision for future data sharing . . . . . . . . . . . . . . . . . . . . . . 263

8.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2638.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

A An XML indexing solution for data integration 268A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268A.2 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269A.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

A.3.1 Index A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270A.3.2 Index B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272A.3.3 Index creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274A.3.4 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275A.3.5 Index A Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275A.3.6 Index B Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276A.3.7 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

A.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

B Detailed diagrams of FGE-OM 280

C Database schema for RAPAD 287

D Modelling and database storage of difference gel data 342D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342

D.1.1 Host-parasite responses . . . . . . . . . . . . . . . . . . . . . . . . . . 343D.2 Gla-PSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343D.3 FGE-OM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345D.4 RAPAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

List of Figures

1.1 A conceptual view of the data flow in functional genomics. . . . . . . . . . . . 31.2 The data flow in a proteomics experiment. . . . . . . . . . . . . . . . . . . . . 61.3 A sample image from 2-DE separation of proteins from Toxoplasma gondii . . 71.4 A schematic of a difference gel electrophoresis experiment. . . . . . . . . . . . 111.5 An MS trace viewed with Voyager software [339]. . . . . . . . . . . . . . . . . 131.6 A sample trace from tandem mass spectrometry . . . . . . . . . . . . . . . . 151.7 Two dimensional liquid chromatography coupled with MS . . . . . . . . . . . 181.8 The ICAT method for quantitative proteomics . . . . . . . . . . . . . . . . . 201.9 A two dimensional gel highlights possible different phosphorylation states of

Protein disulfide isomerase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.10 A summary of the technique involved in the creation of Affymetrix microarrays 301.11 A summary of Yeast Two-Hybrid experiments . . . . . . . . . . . . . . . . . . 341.12 Affinity methods for assaying protein interactions . . . . . . . . . . . . . . . . 35

2.1 A partial record from the PIR database, in the native PIR format. . . . . . . 452.2 A partial record from the PIR database, released in XML format. . . . . . . . 452.3 An example partial PIR record stored in a relational database . . . . . . . . . 462.4 The main components of a UML class diagram for a hospital computer system. 492.5 The top level of MAGE-OM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.6 The BioMaterial package in MAGE-OM . . . . . . . . . . . . . . . . . . . . . 522.7 A screenshot of the Protege editor displaying the Gene Ontology for Yeast. . 662.8 The entry for actin in the Gene Ontology, displayed in the AmiGo browser . 67

3.1 The data flow in a proteomics experiments. The parts of the analysis coveredby Gla-PSI are boxed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.2 The complete PEDRo model represented in UML . . . . . . . . . . . . . . . . 883.3 The classes that record biological samples in PEDRo . . . . . . . . . . . . . . 893.4 The part of PEDRo covering protein separation techniques . . . . . . . . . . 903.5 The model of MS ionisation and protocol in PEDRo . . . . . . . . . . . . . . 913.6 MS data and database searches modelled in PEDRo . . . . . . . . . . . . . . 913.7 The complete Gla-PSI object model represented as a UML class diagram. . . 933.8 A model of 2-DE data, and a scanned gel image. . . . . . . . . . . . . . . . . 943.9 The classes capture data from image analysis applications, including multiple

analysis across a number of gels. . . . . . . . . . . . . . . . . . . . . . . . . . 953.10 The relationship between spot data (Spot) and identified proteins (Protein) 953.11 Classes for storing difference gel electrophoresis data. . . . . . . . . . . . . . . 963.12 The part of Gla-PSI modelling statistical analysis of a proteomics experiment. 973.13 Several classes are subclasses of Identifiable . . . . . . . . . . . . . . . . . 993.14 A draft version of the main components of PSI-OM. . . . . . . . . . . . . . . 1003.15 Part of PSI-OM showing the relationships between spots identified on a gel

and the corresponding protein records. . . . . . . . . . . . . . . . . . . . . . . 101

xiv

xv

3.16 A draft version of the protein data model in PSI-OM . . . . . . . . . . . . . . 102

4.1 A time line displaying the emergence of microarray and proteomics technology,and the efforts to standardise data formats. . . . . . . . . . . . . . . . . . . . 110

4.2 An overview of the FGE-OM object model. The model is divided into threenamespaces: BioOM, ArrayOM and ProteomicsOM. . . . . . . . . . . . . . . 111

4.3 A screenshot of the term “Age” in the MGED Ontology viewed with OilEd. . 1134.4 A complete listing of the packages within FGE-OM. . . . . . . . . . . . . . . 1154.5 The packages and classes in the BioOM namespace of FGE-OM . . . . . . . . 1164.6 The packages in the ArrayOM namespace . . . . . . . . . . . . . . . . . . . . 1174.7 The ProteomicsOM namespace. . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.8 The ProteinSeparation package . . . . . . . . . . . . . . . . . . . . . . . . . . 1204.9 The ProteomeBioAssay package . . . . . . . . . . . . . . . . . . . . . . . . . . 1214.10 The ProteinData package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1224.11 The model of MS data and protocols, adapted from PEDRo. . . . . . . . . . 1234.12 The ProteinRecord package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1234.13 A workflow for a proteomics experiment involving 2-DE or liquid chromatog-

raphy to separate proteins, followed by MS to identify proteins . . . . . . . . 1254.14 A subset of classes in the QuantitationType package from SysBio-OM . . . . 1264.15 The CommonBioAssayData package from SysBio-OM . . . . . . . . . . . . . 1284.16 The top image shows a small subset of classes from the Measurement package

in SysBio-OM, the lower is the Measurement package in FGE-OM. . . . . . . 1294.17 The Protocol package from SysBio-OM . . . . . . . . . . . . . . . . . . . . . . 1304.18 The BioMaterial package from SysBio-OM. . . . . . . . . . . . . . . . . . . . 1324.19 The BioAssay package from SysBio-OM. . . . . . . . . . . . . . . . . . . . . . 133

5.1 A summary of several workflows in functional genomics to illustrate the re-quirements for data integration. . . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.2 A mapping from classes in FGE-OM to database tables in RAPAD. . . . . . 1445.3 The architecture of RAPAD. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1545.4 The user interaction with RAPAD for entering a 2-DE experiment. . . . . . . 1555.5 The interface for entering protocol information into RAPAD. . . . . . . . . . 1575.6 A web page for specifying sources of biological materials . . . . . . . . . . . . 1585.7 A summary of the database schema for storing information about the design

of a study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.8 The database schema for protein separation techniques and the relationships

to the BioAssayTreatment table. . . . . . . . . . . . . . . . . . . . . . . . . . 1605.9 Screenshots for loading 2-DE, scanning and image analysis data into RAPAD 1615.10 The tables present in the database schema store data from gel spots, image

analysis and the scanning of a 2-D gel . . . . . . . . . . . . . . . . . . . . . . 1625.11 The database schema for linking protein records to gel spots . . . . . . . . . . 1625.12 The database schema for mass spectrometry, adapted from PEDRo. . . . . . 1645.13 A screen shot of the 2-D Gel Viewer that provides search capabilities over

protein data and links to MS results . . . . . . . . . . . . . . . . . . . . . . . 1655.14 A form for entering annotation about a gel spot and linking to protein records 1665.15 A table displaying all the proteins identified on a single gel. . . . . . . . . . . 1675.16 The query interface for searching for specific protein records. . . . . . . . . . 168

6.1 The process of matching microarray data to protein abundance data. . . . . . 1866.2 Output from GoMiner, displaying the GO tree browser open for the gene

Tropomyosin 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

xvi

6.3 Output from FatiGO showing the classification of up and down-regulated pro-teins in the Biological Process branch of GO . . . . . . . . . . . . . . . . . . . 190

6.4 The interface for visualising spots across replicate gels . . . . . . . . . . . . . 1926.5 The interface for displaying data combined across replicates . . . . . . . . . . 1946.6 The protein record for Cathepsin B in RAPAD has external links to various

databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1966.7 The table in RAPAD displaying protein abundance and gene expression values 1986.8 Spots matched to vimentin from infected and non-infected samples . . . . . . 2006.9 Spots matched to actin beta from infected and non-infected samples . . . . . 2026.10 Superoxide dismutase from infected and non-infected samples . . . . . . . . . 2056.11 Potential PTMs of protein disulphide isomerase . . . . . . . . . . . . . . . . . 2076.12 The result of a search for potential post-translational modification of protein

disulphide isomerase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2086.13 A summary page displays all the gels present in the experiment, and a link

exists to display the experimental protocols used for each gel. . . . . . . . . . 210

7.1 The life cycle of Trypanosoma brucei . . . . . . . . . . . . . . . . . . . . . . . 2157.2 An electron micrograph of the bloodstream form of Trypanosoma brucei . . . 2167.3 The span of peptides that have been matched within a protein sequence . . . 2217.4 Protein spots matched to β-tubulin, overlaid with a graphic displaying the

span of peptide hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2247.5 Protein spots matched to α-tubulin, overlaid with a graphic displaying the

span of peptide hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2267.6 Protein spots matched to five different Elongation Factors . . . . . . . . . . . 2287.7 Protein spots matched to Elongation factor 1-α . . . . . . . . . . . . . . . . . 2297.8 Protein spots matched to EF-β and EF (putative) are displayed with the

corresponding span of peptide hits . . . . . . . . . . . . . . . . . . . . . . . . 2307.9 The span of peptide hits for protein spots matched to Elongation Factor 2 . . 2327.10 A multiple alignment of five Hsp 70 protein sequences from T. brucei . . . . . 2347.11 Protein spots matched to five different Hsp70 protein sequences . . . . . . . . 2357.12 The interface for publishing T. brucei proteome data . . . . . . . . . . . . . . 2377.13 A search using the Gel Viewer reveals 100 proteins, annotated as “hypothetical”2387.14 The protein spots that have been matched to different hypothetical proteins . 2397.15 Four spots containing arginine kinase. The MS results for spots 575 and 535

reveal possible modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2427.16 There are four spots that match initiation factor 5, of which possible modifi-

cations were found for spots 554 and 575 . . . . . . . . . . . . . . . . . . . . . 243

8.1 A possible model for future data sharing and exchange . . . . . . . . . . . . . 264

A.1 Index A has four components: the Data Path Tree, Data Stores, XML LocaterLists and an XML Dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

A.2 Index B has four components: the Data Path Tree, Data Stores, the StructureContainer and the XML Dictionary (not shown). . . . . . . . . . . . . . . . . 273

A.3 The method used to implement a join query in Index B is implemented in asix stage algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

A.4 A prototype interface for querying an indexed store of XML data. . . . . . . 278

B.1 The ProteinSeparation package of FGE-OM. . . . . . . . . . . . . . . . . . . 281B.2 The ProteomeBioAssay package. . . . . . . . . . . . . . . . . . . . . . . . . . 282B.3 The ProteinData package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

xvii

B.4 The ProteinRecord package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284B.5 The MassSpecProtocol package. . . . . . . . . . . . . . . . . . . . . . . . . . . 285B.6 The MassSpecData package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

D.1 The part of Gla-PSI covering DIGE experiments . . . . . . . . . . . . . . . . 344D.2 A DIGE experiment represented in Gla-PSI . . . . . . . . . . . . . . . . . . . 346D.3 A DIGE study represented in FGE-OM . . . . . . . . . . . . . . . . . . . . . 348D.4 Relative protein abundance data calculated from DIGE can be viewed in the

Gel Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350

List of Tables

1.1 Software available for image analysis of 2-D gels. . . . . . . . . . . . . . . . . 91.2 Software available for searching mass spectrometry data. . . . . . . . . . . . . 14

2.1 Summary table displaying features of microarray databases . . . . . . . . . . 62

3.1 A summary of the interviews held with researchers to formulate an under-standing of proteomics research. . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.1 The correspondence between gene and protein abundance for HFF cells in-fected with T. gondii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

A.1 Build times in seconds for Index A and B for four different sizes of data set . 274A.2 Summary of query timings for Index A, values are time in seconds . . . . . . 276A.3 Summary of query timings for Index B, with different caching procedures. . . 276

D.1 Experimental plan for Cy labelling of proteins in the DIGE experiment . . . 343

xviii

xix

Commonly used abbreviations

2-DE - Two dimensional gel electrophoresisAPI - Application Programming InterfacecDNA - coding DNAEST - Expressed Sequence TagFG - Functional GenomicsFGE-OM - Functional Genomics Experiment Object ModelGla-PSI - Glasgow proposal for the Proteomics Standards InitiativeGO - Gene OntologyHUPO - Human Proteome OrganisationIPG - Immobilized pH GradientLC-MS - Liquid Chromatography-Mass SpectrometryLIMS - Laboratory Information Management SystemMAGE-ML - Microarray and Gene Expression Markup LanguageMAGE-OM - Microarray and Gene Expression Object ModelMALDI - Matrix-Assisted Laser Desorption IonisationMGED Society - Microarray Gene Expression Data SocietyMIAME - Minimum Information About a Microarray ExperimentMIAPE - Minimum Information About a Proteomics ExperimentMO - MGED OntologymRNA - messenger RNAMS - Mass SpectrometryMW - Molecular weightNMR - Nuclear Magnetic ResonancePEDRo - Proteomics Experiment Data RepositorypI - Isoelectric pointPSI - Proteomics Standards InitiativePSI-OM - Proteomics Standards Initiative Object ModelPSI-Ont - Proteomics Standards Initiative OntologyPTM - Post-Translational ModificationRAD - RNA Abundance DatabaseRAPAD - RNA And Protein Abundance DatabaseRDF - Resource Description FrameworkRDMS - Relational Database Management SystemsRNAi - RNA interferenceSAGE - Serial Analysis of Gene ExpressionTOF - Time Of flightUML - Unified Modeling LanguageURI - Universal Resource IdentifierURL - Uniform Resource LocaterW3C - The World Wide Web ConsortiumXMI - XML Metadata InterchangeXML - Extensible Markup Language

Chapter 1

Investigations in Functional

Genomics

1.1 Introduction

In recent years, the sequencing of the human genome has gained much deserved publicity

[164, 334]. The sequence of man, and all the model organisms, has generated a vast amount

of information about the basis of life at the molecular level. This was only possible due

to progress in the way in which DNA sequencing is performed [276, 305], and the work of

bioinformaticians to produce software that can assemble the huge genome sequences, find

genes and determine similarity between genes in different organisms. We can state to a

reasonable level of accuracy how many genes there are in man (23758 genes are currently

predicted in Ensembl [94]), mouse (26762 in Ensembl), and yeast (approximately 6,000 [190])

and new genomes can be sequenced on relatively short time scales. However, the genome

sequence is only a starting point, the actual DNA sequences comprising the genes tell us

nothing about how living systems function, and what happens when they go wrong, causing

disease. This knowledge requires information about the molecular function performed by

the proteins encoded by every gene, the interaction partners for the proteins, and the subtle

changes that are propagated to the whole system when a protein malfunctions, or is not

present. One of the most conclusive arguments about how far there is to go in molecular

biology is provided by the surprisingly small difference in the total number of genes between

the nematode worm Caenorhabditis elegans (about 20000 [353]) and humans (20000 - 40000

depending on different estimates) [59]. C. elegans contains only 959 cells and the difference

in biological complexity between it and man is vast, yet this is not caused by the number of

genes. We must ask how such a small number of genes in humans gives rise to the number of

different cell types, the complex development of organs and ultimately the intricacy of brain

1

Chapter 1. Investigations in Functional Genomics 2

circuitry that leads to consciousness. The answer must lie in several phenomena: the actual

number of functional proteins being far larger than the number of genes, caused by differential

splicing creating multiple products from a single gene; modifications to proteins that alter

their function; protein interactions giving rise to complex new functions not achieved by

single proteins; and exquisite regulation of when and where genes are expressed. Therefore,

simply assigning one single function to a gene is a major over simplification as it fails to

capture the richness of the whole system, including the possibility for a gene to encode more

than one protein. Furthermore, each protein form may have several different functions in

different physiological locations.

1.1.1 Experimental methodology

A number of new experimental approaches have arisen, to perform large scale analysis of

systems, which have been the result of technological developments, collectively known as

functional genomics (FG). FG involves the analysis of very large data sets, to find the genes

or proteins that are implicated in disease processes or the changes that result from external

stimuli, and to aid efforts to annotate all genes with information about their biological

function. The workflow displayed in Figure 1.1 gives an overview of how different experiments

can be used to gain insights into gene function. FG includes studies that determine gene

expression, protein abundance (in proteomics), protein localisation and others. The different

methods can be classified into seven categories [360], which can be used to assign a function

to a protein by investigating:

• The extent of expression of a protein under different conditions and in different loca-

tions.

• The interaction partners for a protein.

• The gene neighbourhood, including any co-expressed genes, such as bacterial operons.

• The phenotype of the gene knockout.

• The biochemical activity of the protein once isolated.

• Any post-translational modifications that are observed.

• The three dimensional structure of the protein.

The experiments present significant computational challenges due to the vast sizes of data

sets, the heterogeneity in the information generated by each different lab, and the frequency


Legend

Sample Flow

Data Flow

SampleBiological

SampleBiological

ExperimentDesign

SampleBiological

Microarray analysis Proteomics Immunohistochemistry Metabolomics

GenomeSequencing

Genome sequence

Data Integration

Statistical processing

Image analysis

Determine sequence,assemble and find genes

Overview of Functional Genomics Experiments

Functionally annotated genome

Measure relative levelof mRNA expression to identify proteins

Mass Spectrometryscanning microscopeView samples with a

Global gene expression Global protein expression Positional expression profile Metabolite profile

separate by 2−D gelExtract protein and Apply antibodies

to samplesExtract mRNA andapply to microarray Protocols Separate metabolites by

Mass spectrometry or NMRto detect metabolite profiles

gas chromatographyclone fragmentsExtract DNA and

Figure 1.1: A conceptual view of the data flow in functional genomics.

at which new laboratory techniques are developed. It is vital that functional genomics data

sets can be integrated and adequately queried, linked to gene databases, and exchanged

between research groups [322]. This requires: (i) the development of new database tech-

nologies, and (ii) standard formats to which published data must adhere. The focus for the

work presented in the thesis is to address these two questions for proteomic studies.

1.1.2 Systems biology

A new research area in the life sciences is an effort to understand all the components and

interactions that comprise the entire system, so called systems biology. Systems biology

and functional genomics are not synonymous but there is a large overlap between the two

domains. Functional genomics is the acquisition of data about the function of genes on a

large scale, using various technologies. Systems biology is the discipline of trying to order

all the available information into an understanding about how components interact. One

of the main sources of data can be from functional genomics studies, although that alone

is not enough to build up a complete picture of the system. Critically, in many functional

genomic studies there is no information about causality. If a group of genes are up-regulated

under a particular biological condition, it is not possible to say if the genes are regulated


in response to the condition, or if the condition is caused by the change in gene regulation

[188]. A complete understanding of metabolic pathways requires experiments that assay the

biochemical reactions, such as the flux in the pathways under a certain condition, compared

with the steady state. New technological advances will enable single molecule measurements

and visualisation of molecular interactions that will be crucial to systems biology, by allow-

ing researchers to derive insights into cellular processes at previously impossible resolution.

These new technologies will require significant database support.

1.1.3 Overview

The scope of our work is restricted to developing technology to aid functional genomics

research. The main focus is the development of a database and a data standard for the

proteomic techniques that are used to detect and measure the abundance of proteins in

complex samples, and to integrate these data with results from other types of experiments.

In this chapter, the main techniques in functional genomics research are described, along

with the computational challenges they present. An outline of the experimental techniques

in proteomics, and three case studies that have been performed, is given in Section 1.2. The

experimental techniques that measure gene expression are described in Section 1.3. Other

types of functional genomics research are described in Section 1.4, and a summary of major

functional genomics investigations is given in Section 1.5.

1.2 Proteomics

The proteome of a sample is the complete set of expressed proteins in a sample of interest,

or the entire set of proteins that could be found in an organism. The term “proteomics”

was first used in the mid 1990s to refer to a newly emerging approach of analysing large

numbers of proteins expressed in a sample [345, 349]. Knowledge of the proteins expressed

in a sample can aid understanding the entire system if the functions of proteins are well

understood. Alternatively, proteomics experiment can give insights into the functions of

proteins that have little annotation, for example if a protein is strongly expressed in one

condition compared with another [362]. Researchers aim to define the proteome of a cellular

sample, tissue, organ or organism using various techniques. The proteome is highly dynamic:

the volume of different proteins change, proteins are translocated to different organelles,

chemical modifications alter the behaviour of proteins and protein-protein interactions give

rise to complex new functions. Researchers are often limited to taking a snapshot of the


system at one time, but as the size of data sets continue to increase, it will be possible to

gain a more complete understanding [137]. Data sets produced by different laboratories may

comprise heterogeneous file formats produced from different sources, which are difficult to

compare, therefore the requirements for bioinformatics support continue to grow. Data sets

must be made publicly accessible, and software must be designed that allows researchers

to perform detailed re-analysis of data, using various statistical packages. This area is the

focus of Chapter 3, which describes our work on the development of a standard data format

for proteomics. A second issue is that there are currently no major public databases for

publishing proteome data sets, although several are in development. In Chapter 5, there

is a description of a database for proteomics that we have developed as a prototype for a

public repository. The database supports two on-going projects at the University of Glasgow,

described in Chapters 6 and 7.

The emergence of proteomics has been achieved through the developments of new tech-

nologies, although still one of the most commonly used approaches is that of protein separa-

tion by two dimensional gel electrophoresis (2-DE). 2-DE was first developed in the 1970s,

and pioneered in the 1980s by Angelika Gorg and colleagues [136], and while 2-DE techniques

have improved, the experimental basis remains the same today [135]. The main technique

for identifying proteins is mass spectrometry (MS), in which there have been major technical

advances, coupled with the development of software, enabling clear identification of proteins,

even in mixed samples. In this section, gel based proteomics are described in Section 1.2.1.

MS techniques are outlined in Section 1.2.2, other proteomics techniques are described in

Section 1.2.3, and investigations into post-translational modifications are outlined in Section

1.2.4.

1.2.1 Gel based proteomics

The majority of proteomics experiments involve a stage of protein separation, followed by a

technique for identifying proteins once isolated from the mixture. One of the most common

processes is the use of gel electrophoresis, coupled with mass spectrometry. Figure 1.2

displays a workflow from an experiment to determine the abundance of a large set of proteins.

Initially, proteins are extracted from a starting sample and solubilised using a protocol that

is dependent upon the origin of sample and the technique used. Proteomics is not restricted

to a particular area of the life sciences, but can be performed on almost any type of biological

substance, such as microbial cultures, tissues, organs, whole organisms and environmental


Sample B Sample CSample A

Protein Expression Profile

Sequence Database

Protein Identification

Search

Overview of a Proteomics Experiment

DesignExperiment

ID Vol X Y Protein

1 454 23 24

2 222 28 87 abc1

3 12 20 12

4 662 262 101

1 454 23 24

2 222 28 87

3 12 20 12

4 662 262 101

1 454 23 24

2 222 28 87

3 12 20 12

4 662 262 101

ID Vol X Y Protein ID Vol X Y Protein

2D−PAGE

SolubilisationProtein

StatisticalAnalysis

Add protein ID toabundance data

Digest withtrypsin across gels

Compare abundance

Legend

Sample Flow

Data Flow

Image Analysis

MS/MSMALDIMass Spectrometry

Figure 1.2: The data flow in a proteomics experiment.


pH 4 pH 7

MW

Figure 1.3: A sample image from 2-DE separation of proteins from Toxoplasma gondii (cour-tesy of A. M. Cohen).

samples.

The solubilised protein mixture is applied to an IPG (Immobilized pH Gradient) strip

and an electric current is applied. A protein migrates to a specific position in the pH gradient

where it has no net charge, in a process known as isoelectric focusing. In the second dimen-

sion, the strip is placed on top of a polyacrylamide gel1 and a second current is applied. The

gel contains a denaturing agent, such as SDS, which causes the three dimensional structure

of the protein to unfold, and gives each protein a net negative charge. In this dimension

the proteins migrate into the gel to a distance that is dependent on their molecular weight.

Smaller proteins migrate furthest and tend to appear at the bottom of gels in most images.

The proteins can be visualised by staining (Figure 1.3). Different IPG strips can be used

to separate proteins with different pI (isoelectric point) values, for example a standard IPG

strip may separate on a 4 - 7 pH gradient. However, to achieve finer resolution of spots, a 5.5

- 6.5 pH gradient more accurately resolves spots with a charge value in this range. Proteins

with charge values at the extremes of the pH gradient may not be observed on a 2-D gel.

This issue is discussed in the Limitations section.

1The abbreviation 2D-PAGE (Two dimensional PolyAcrylamide Gel Electrophoresis) is often used in theliterature.


Image analysis and quantification of protein spot volume

A 2-D gel can be stained to visualise proteins, using Coomassie blue or silver (discussed

below), and scanned with a flat bed scanner. The scanned image is analysed with specialised

software that detects properties of protein spots, including their coordinates within the image

and an estimate of the volume of protein in the gel. Coordinates are usually specified as the

central point of a circular spot with a particular diameter, or as a set of boundary points

that specify the exact shape of the spot in two dimensions. The volume is estimated from the

darkness of each pixel within the spot. Different software packages have different methods

for quantifying the volume of protein in a spot, and most apply a strategy to normalise the

values across the gel, or a set of replicate gels. The software can match spots produced on

different gels which correspond to the same protein, and determine the relative difference in

the spot size and intensity across two or more gels. One problem that arises is that there

has been little work comparing the algorithms used for quantifying protein spots, or on the

relationship between the amount of visible spot and the actual volume of protein, which

is dependent upon the stain used. Generally fluorescent dyes give the best sensitivity and

linearity. Other stains include Coomassie blue and silver staining. Silver stains allow lower

volumes of protein to be visualised, but there is poor digestion of silver stained proteins with

trypsin and the stains are notoriously non-linear. Coomassie blue offers reasonable linearity

[200] and is widely used due to low cost, although it is less sensitive than either silver staining

or fluorescent dyes.

A goal of computational research is to perform analysis of protein abundance values from

2-D gels produced by different laboratories, as is happening in the microarray field [86].

However, this cannot occur without significant efforts to determine how different software

packages perform gel image analysis. The ProteomeGRID is attempting this kind of anal-

ysis by creating an automated infrastructure for analysing and comparing 2-D gels, using

high performance distributed computing [256]. Large scale analysis of images from different

sources requires software companies to have an open approach to the algorithms or statis-

tical techniques offered by their software, or they must collaborate to create a standardised

output. An alternative would be for researchers to release the original high-resolution scans

of images, in addition to lists of protein volumes, to enable future re-evaluation of large

collections of images in a single analysis. One analysis has been performed to compare the

quality of spot detection in two software packages (Z3 [366] and Melanie 3 [210]) [264]. It

was discovered that both perform reasonably well at detecting spots (approximately 90%


• ImageMaster published by Amersham Biosciences,http://www.amershambiosciences.com

• Melanie 4 - developed at the Swiss Institute for Bioinformatics,http://ca.expasy.org/melanie/

• DeCyder published by Amersham Biosciences,http://www.amershambiosciences.com

• PDQuest published by Bio-Rad, http://www.bio-rad.com/

• Z3 published by Compugen, http://www.2dgels.com/

• ProGenesis published by Prolific, Inc.,http://www.prolificinc.com/progenesis.html

• Delta 2D published by Bio Imaging,http://www.raytest.de/bio imaging/products/delta2D/delta2d.html

Table 1.1: Software available for image analysis of 2-D gels.

accuracy), and moderately well for detecting ratios of volumes where the ratio is not great

(less than 1:6). A more detailed analysis is required of all the different software packages that

perform image analysis. This work is beyond our scope, but a list of the software packages

available for image analysis is given in Table 1.1.

In the current situation there is little quality control over protein volume values, therefore

the values have limited scope outside of the original experiment. There have been several

efforts to automate the process of comparing large collections of gel images, such as Veeser

et al. 2001 [331] and Rogers et al. 2003 [272]. These efforts are similar to the comparisons

that are being performed across large numbers of microarrays to detect patterns of gene

expression [86, 319], however there are several challenges that must be overcome before large

scale comparisons can be made over 2-D gels. There is variability in the appearance of

gel spots, causing difficulties matching spots across a series of gels [338], different staining

protocols affect the signal strength, and errors can be made in correct protein identification.

A review of current progress in the area of algorithms for detecting and quantifying protein

spots is given by Dowsey, Dunn and Yang [83]. This is an area in which significant future

research is required.

Difference gel electrophoresis

A major new technology in gel based proteomics is two-dimensional difference gel elec-

trophoresis [327], or DIGE2, in which two samples are labelled with different fluorescent

dyes, mixed and separated on a single gel. The gel is scanned at different wavelengths,

2Ettan DIGETM: Fluorescence 2D Difference Gel Electrophoresis [98] produced by Amersham Biosciences.


creating two images that can be compared. This removes the variability in resolving spots

on different gels thereby improving the matching of spots between gels. The system can

be adapted to use three dyes. The third dye is used to label a mixture of proteins formed

by pooling the two samples in the experiment, to improve normalisation of protein vol-

umes between different images, allowing smaller changes in protein level to be determined

as significant (Figure 1.4) [8].

Limitations of gel electrophoresis

There are several limitations of 2-DE technology. Firstly, membrane and nuclear proteins

tend to be highly hydrophobic and difficult to solubilise, therefore they often do not appear

on a gel [3]. Secondly, high molecular weight proteins do not migrate well through gels and

may not be detected. Thirdly, 2-DE tends to detect high abundance proteins and many

functionally important proteins may be present only in small quantities. Finally, it is fairly

common for multiple proteins to co-migrate to the same spot, causing problems quantifying

the volume of individual proteins. However, this limitation can be avoided by the use of

narrow range pH gels, or zoom gels that improve the resolution of gel spots. Another

advance in gel electrophoresis is sample prefractionation. A protocol reported by Zuo and

Speicher in 2002 [370] can resolve complex mixtures of proteins by first separating proteins

into separate pools based on the charge of proteins. Each fraction of the sample is analysed

by 2-DE, performed over several overlapping narrow range pH gradient gels. This technique

allows more low abundance proteins to be detected as there is a general improvement in

spot resolution, and high abundance proteins are less likely to mask or interfere with other

protein spots. The detection of membrane proteins by 2-DE has been improved by systematic

analysis of the different variables and constituents of buffers to maximise the solubility of

membrane proteins, allowing improved loading of the proteins onto gels [277]. A review

of optimised solubilisation procedures for resolving membrane proteins is given by Molloy

[217]. The poor reproducibility of 2-DE is often discussed as a major limitation, however

the gradual improvements in protocols for the two dimensions mean that reproducibility of

2-DE is now fairly high [317].


Sample pooling

Sample A Sample B

Extract proteins Extract proteins

Attach blue label Attach green label Attach red label

Recombine samples

Separate by 2−DE

(green) (red)(blue)

Combined Image

Image 1 Image 2 Image 3

Scan gel at three wavelengths

Figure 1.4: A schematic of a difference gel electrophoresis experiment.


1.2.2 Mass spectrometry

Ionisation types

The most common method of protein identification in proteomics is mass spectrometry (MS,

a review of techniques is given by Mann [203]). In gel based proteomics, a protein spot

is excised from the gel and digested with a protease that cleaves the protein at specific,

predictable positions along its length to form a set of peptides. The most commonly used

protease is trypsin. The peptide mixture can be applied to a matrix and a laser is fired at

a particular wavelength. A matrix is used that absorbs at the chosen wavelength, causing

the proteins to become ionised. This process is matrix-assisted laser desorption ionisation

(MALDI) as developed by Karas and Hillenkamp in the late 1980s [180, 151], which is often

used for identifying proteins in conjunction with gel electrophoresis. An alternate ionisation

approach is electrospray first developed by Fenn and colleagues [102], in which a liquid

containing the peptide mixture is forced through a gold or platinum plated glass capillary

with a fine tip, at a high voltage, causing small droplets to form in a spray. The droplets

evaporate, imparting their charge to the peptides.

Detection

There are various methods for detecting the mass of the peptides that have been ionised.

Time of flight (TOF) is often coupled with MALDI (MALDI-TOF), and functions in the

following way. A laser fires at the matrix, imparting a fixed amount of kinetic energy to the

peptides. The ionised peptides travel through the mass spectrometer and reach the detector

in an amount of time that is dependent on the mass of the peptide, hence smaller peptides

travel faster. Therefore, the mass of each peptide can be determined from the length of time

taken to reach the detector.

A quadrupole detector is commonly used with electrospray ionisation. A quadrupole

consists of four electrically charged rods to which an oscillating current is applied. Pep-

tides travel through the quadrupole but only at a particular amplitude of electric field can

a peptide, of a given mass, reach the detector. Therefore, a range of amplitudes is scanned,

allowing the mass of a peptide to be inferred from the amplitude at that time. A similar

system is the quadrupole ion trap in which ions enter a device that comprises several elec-

trodes trapping the ions inside. Various voltages are applied to the electrodes to eject ions

according to their mass:charge ratios. The ions are focused and detected using an electron

multiplier [177].


Figure 1.5: An MS trace viewed with Voyager software [339].

A recent advance in detection is Fourier Transform Ion Cyclotron Resonance (FTICR)

mass spectrometry [205]. FTICR can be coupled with both MALDI or electrospray ionisa-

tion and ions are collected in a cell (ICR trap), which is surrounded by a large electromagnet

that causes the ions to resonate. The resonation can be detected by an electrode and con-

verted into a mass:charge ratio, producing a similar spectrum to that produced from TOF

or quadrupole detection.

Data interpretation

Regardless of the method of ionisation, the result is a list of peptide masses on an MS

trace (Figure 1.5). Initially, a noise reduction procedure may be performed on a trace to

remove very weakly detected masses that are unlikely to be the result of genuine peptides.

The software supplied with the mass spectrometer can perform this task automatically but

the researcher may also manually select the strong peaks that they believe correspond to

peptides. The complete set of peptide masses, called the peptide mass fingerprint, can be

used to identify the protein. The list of masses is entered into a search engine that queries a

database of protein sequences, or translated DNA sequences, on which a theoretical digest

is performed. The search engine allows the researcher to specify which protease was used for

digesting the protein and calculates, for every protein in the database, the expected peptide

masses that would result from using that protease. Table 1.2 displays some of the software

that is available for searching peptide mass data. The software finds the proteins in the


• PROWL - http://prowl.rockefeller.edu/

• MOWSE - http://srs.hgmp.mrc.ac.uk/cgi-bin/mowse

• ProteinProspector - http://prospector.ucsf.edu/

• MASCOT - http://www.matrixscience.com

• SEQUEST - http://fields.scripps.edu/sequest/

• PepMAPPER - http://wolf.bms.umist.ac.uk/mapper/

Table 1.2: Software available for searching mass spectrometry data.

database that have a set of predicted peptide masses that match most closely the observed

peptide masses. The software produces output that includes a statistical score indicating the

likelihood of a correct match, the number of peptides matched, and the percentage coverage

of the peptides matched out of the entire protein sequence. Each value has a statistical

basis, but the researcher uses a combination of these measures that is dependent on various

criteria, to decide if a protein has been correctly matched. In some cases, obtaining complete

coverage of the proteome may be of primary importance, and a low threshold will be used

that allows some false positives. In other situations, finding the exact identity of a single

protein is crucial and a high threshold will be used.

Tandem mass spectrometry

The peptide mass fingerprint method does not always identify a protein with sufficient confi-

dence. In these cases, an alternative approach called tandem mass spectrometry, or MS/MS,

can be used. MS/MS is so called because it involves two sequential MS stages. The first

stage separates proteins into different peptides by their mass but, rather than the ionised

peptide hitting a detector, a peptide is selected, and it is collided with an inert gas such as

argon or nitrogen. The collision causes the bonds between amino acids to split, resulting in

a range of ionised fragments. The mass of each ionised fragment is detected in the second

MS stage. For example, if the selected peptide contains eight amino acids, the fragmentation

would produce new peptides containing 8, 7, 6, 5, 4 amino acids and so on in the second

stage. The difference in mass between each new peptide corresponds with the exact mass of

the amino acids that is lost between the two peptides. The masses of the fragments can be

read from right to left on a trace, revealing the amino acid sequence of the peptide (Figure

1.6). The peptide sequence, or the set of masses from the MS/MS trace, can then be searched

against a sequence database to find an exact match (or near exact) that will conclusively

identify the protein.


Figure 1.6: Three traces from a tandem mass spectrometry experiment, reproduced from[189]. Image a displays the first MS stage from which the two strongest peptides are selectedfor fragmentation. The results of the second stage fragmentation are shown in (b) and (c).The difference between the mass of the peaks, shown on the y-axis, corresponds to the massof the individual amino acids that form the peptide sequences shown.


Standardising mass spectrometry

One of the major limitations of MS is that there is neither any standardisation across the

methods employed by different instruments to measure peptide masses, nor in the input

parameters for the instruments. One effort to remedy this situation is provided in a study

by Purvine and colleagues [259]. They created a standard mixture of peptides and proteins,

which they assayed by liquid chromatography and MS (LC-MS), coupled with a database

search engine. The system correctly identified 23 peptides and 12 proteins from the mixture.

The experimental methodology has been released as a standard for assessing the quality of

studies, to see how effectively other systems can identify different proteins from within the

mixture.

The peak list generated from MS is usually entered into a search engine to identify the

protein. Each different application has its own measures of the quality of a protein match

and a researcher often decides, using a combination of measures, whether an identification is

correct. The measures of correct matching often depend upon the software being used, and a

cut-off is determined by each laboratory, using their own criteria that depend upon the type

of experiment. This means that there is no standard method for comparing the likelihood

of a correct match between data produced from different laboratory setups. Therefore, it

is very difficult to ascertain in large data sets the statistical probability that a protein has

been correctly identified. The efforts of the Proteomics Standards Initiative to solve some of

the standardisation problems are described in the following chapter.

1.2.3 Other proteomics techniques

One of the main criticisms of 2-DE based proteomics is the unreliability of estimates of

protein volume made by image analysis. The stain used to visualise protein spots greatly

affects the linearity of the relationship between true protein volume and the spot density

measured by analysis software. There is little information in the published literature about

the accuracy of measurement of protein volumes, therefore in the past results have often

been qualitative: spots are present on one gel and absent on another, or clearly up or down

regulated with large fold differences. However, recent advances in staining or labelling of

proteins, such as DIGE analysis, and improvements in software have enabled quantitative

measurement of protein volume from 2-D gels [181, 328]. In the microarray domain there

has been substantial work on the quantification and statistical analysis of results to be able

to say what differences are statistically significant (examples include [130, 267, 312]). The


interpretation of results would be easier if quantitative analysis of proteomics data sets could

be performed. Towards this goal a set of new experimental techniques have been devised for

quantifying protein volumes in samples, as described below.

A limitation of 2-DE based proteomics is that highly abundant proteins are identified

much more readily than low abundance proteins. Many functionally significant proteins,

such as transcription factors, are present in low copy number in the cell, and it is vital

that these proteins can be assayed. Therefore, techniques have been developed that perform

proteomic analysis using separation techniques other than 2-DE, which detect proteins that

are expressed at low levels.

Liquid chromatography and tandem mass spectrometry

A technique has been developed in the labs of John Yates at the Scripps Institute, for identi-

fying large numbers of expressed proteins. This technique is unbiased with regard to protein

volume, protein charge or molecular weight, and can identify membrane proteins [344]. The

technique is known as MudPIT (Multidimensional Protein Identification Technology). Mud-

PIT is a further development of a technique reported in 1999, in which two dimensional

liquid chromatography (LC) is coupled directly with MS (LC-MS, Figure 1.7) [195].

There are many variations in the functionality of LC but the principle is that a solution

containing the proteins or peptides to be separated is applied to a column. The column

contains substances that create a gradient to fractionate the mixture based on the charge or

hydrophobicity of the proteins [290]. Reverse phase (RP) chromatography is often performed

in proteomics, in which a column is filled with an aqueous solution and there is an increasing

gradient of an organic solvent. Different fractions are eluted from the column according

to their hydrophobicity as the gradient of solvent increases. The fractions can be collected

for further separations or analyses, such as mass spectrometry, because RP can be directly

coupled to electrospray ionisation. One of the limitations of this technique is that complex

mixture of proteins, such as the entire proteome of a sample, often cannot be adequately

resolved. This problem can be overcome by performing two-dimensional chromatography

in which two sequential stages are performed, which separate on different properties of the

mixture. The first stage is often ion-exchange chromatography, for instance eluting particular

proteins using different concentrations of KCl in stages, causing proteins or peptides to

separate differentially according to their charge.

MudPIT was used with the SEQUEST software for performing database searches [93] in


Denaturated protein complex

Identified proteins in complex

Peptides (pH < 3)

2D chromatographic

separation of pepetides

Peptide fragmentation using

tandem mass spectrometry

Computational translation of

tandem mass spectra to amino

acid sequences using genomic

sequences

Figure 1.7: Two dimensional liquid chromatography coupled with MS for identifying largenumbers of proteins from a mixture, reproduced from [195]. Two phases of LC are per-formed: (i) strong cation exchange (SCX) for separating by charge, (ii) reversed phase (RP)separating by hydrophobicity, followed by tandem mass spectrometry.


a study reported in 2001 [344]. The technique was used to identify almost 1500 proteins from

the Saccharomyces cerevisiae proteome, including proteins with extremes in pI, MW, abun-

dance and hydrophobicity. Many studies have been performed to determine the proteome

of S. cerevisiae by 2-DE and MS, however previous to this analysis the largest study had

resolved only 279 proteins [245]. A later refinement of the process was reported by Peng et al.

2003 [243] in another study of the yeast proteome, using two dimensional chromatography,

coupled with tandem mass spectrometry. The study identified a similar number of proteins,

approximately 1500, and reported a very low rate of false positives (less that 1%).

ICAT

The technique of mass spectrometry for protein identification has been discussed above, but

if performed using a standard protocol, MS does not produce quantitative output. This is

because the height of peaks on a trace are very poorly reproducible, and do not generally cor-

relate well with the amount of protein in the starting sample. In 1999, a new technique was

reported by Gygi and colleagues [142], in which MS was coupled with liquid chromatography

for protein separation, and proteins from two different samples could be compared concur-

rently. The scheme is shown in Figure 1.8 and consists of labelling proteins from two different

conditions with ICAT reagents (Isotope-Coded Affinity Tags). ICAT has a component that

binds cysteine residues in proteins, with an isotopically heavy reagent binding proteins in

one sample, and an isotopically light reagent binding proteins in the other sample. The sam-

ples are combined, and enzymatically cleaved to produce peptides. The ICAT reagent also

includes biotin which allows peptides to be extracted with an avidin affinity column because

avidin binds biotin with a very high affinity. Peptides labelled with the ICAT reagent are

captured in the affinity column. The peptides are then analysed in a mass spectrometer

which reveals a pair of adjacent peaks for each peptide. The adjacent peaks are separated

by a difference of 8 Da, which is the difference in mass between the heavy and light isotope.

The difference in peak height represents the relative volume of protein that was present in

the two samples. At this stage, there is no information about protein identity. The peptides

undergo a second stage of MS, in which peptides are fragmented into amino acids (MS/MS

described above) to reveal the amino acid sequence that in many cases can be used to search

a sequence database, correctly identifying the protein. In the original paper describing the

method, ICAT was used to analyse the volume of proteins in two cultures of yeast growing

in different media. The authors were able to identify subtle changes in protein expression


Mass/charge

Rel

ativ

e ab

unda

nce

Cell State 1(light ICAT)

Cell State 2(heavy ICAT)

Combine samples andproteolyse

Affinity isolation of

ICAT peptides

MS/MS analysis to identify protein

Peptide B

Mass/charge

Rel

ativ

e ab

unda

nce

Peptide C Peptide D

Peptide A

from sequence of peptide A

Quantify relative protein abundanceby measuring ratio of peaks

Figure 1.8: The ICAT method for quantitative proteomics.


that correlate well with previously published data. One limitation of this method is that the

reagents bind cysteine residues, and cysteine is one of the rarest amino acids. However, the

first publication about ICAT suggests that the percentage of cysteine-free proteins in yeast

is only 8%.

SILAC

A similar approach for quantifying protein abundance is SILAC (stable isotopic amino acids

in cell culture), presented by Blagoev in 2003 [36]. In this approach, a heavy isotope of

arginine or leucine, labelled with C13, is incorporated into the medium in which cells are

growing. A separate culture is grown in normal medium for a different condition. The

proteins are then extracted, digested with a protease and analysed by mass spectrometry.

Each peptide that contains an arginine residue is represented by a pair of adjacent peaks,

caused by a slight increase in mass of the peptide in the heavy carbon medium. It is expected

that all proteins contain arginine residues. The method was utilised to examine the EGFR

(Epidermal Growth Factor Receptor) pathway. One culture was stimulated with EGF, the

other was not stimulated. The cells from both cultures were lysed, mixed in a 1:1 ratio, and

an affinity column was used to extract proteins that interact with EGFR. The difference

in volumes for proteins implicated in EGFR processes were accurately determined by pairs

of peptides, as for the ICAT method. However, SILAC can only be used for cell cultures

growing in a medium whereas ICAT reagents are used to label the proteins after they have

been extracted from the sample, therefore there are fewer restrictions on the samples that

can be analysed with ICAT.

Other differential labelling strategies

ICAT and SILAC were two of the first procedures reported for labelling proteins to quantify

their abundance on a large scale by mass spectrometry. However there are various other

labels that have been used to create “heavy” and “light” isotopes that can be detected by

MS. An example is iTRAQ (isotope Tags for Relative and Absolute Quantitation) which

functions in a similar way to ICAT but has the advantage that more than two samples can

be compared concurrently [15]. The use of H2O16/18 [214], deuterated hydrogen and various

other tags to amino acid sidechains have also been applied to protein quantification (reviewed

in [284]). It is likely that these methods will begin to overtake gel based quantitation of

protein abundance as they do not suffer from the same limitations in the range of proteins


that will be identified.

1.2.4 Post-translational modifications

The genome sequence is a static representation of biology, and while it is possible to predict

the amino acid sequence of proteins with a high degree of accuracy, this does not reflect the

complete picture of proteins as functional units in cells. The chemical alteration of proteins,

known as post-translational modification (PTM), is a common phenomenon that occurs in a

time and signal controlled manner. Modifications include the addition and removal of phos-

phate groups (phosphorylation and dephosphorylation), which are well known mechanisms

for controlling the catalytic and signalling activity of proteins [172]. For example, receptor

tyrosine kinases (RTKs) potentiate external signals to the inside of cells. RTKs reside in

cell membranes and, when bound by a ligand, change in conformation, switching on their

kinase activity. The RTK subsequently binds and phosphorylates other proteins within the

cell, transmitting the signal downstream [206].

The addition of carbohydrate molecules to proteins, termed glycosylation, is the most

common type of modification. Analysis of glycosylation, or “glycomics”, describes studies to

find all the carbohydrate molecules produced by a protein, and already 5000 genes have been

assigned as having a potential role in the synthesis of carbohydrates across all the sequences

deposited in GenBank [337]. Other types of modification are acetylation, methylation and

cysteine oxidation. In general, modifications cause proteins to change in conformation, lead-

ing to the protein translocating to another part of the cell, or causing new protein interactions

to form. Modifications play a role in maintaining the tertiary (the 3-D conformation of a

single protein unit) and the quaternary (multi-protein complex) structure of proteins, and

are therefore ultimately associated with function.

Identifying PTMs

Protein modifications can be identified by 2-DE coupled with MS, and various other methods

for their detection have been developed (a review of techniques is given by Mann and Jensen

[204]). Distinct protein spots can be observed on a 2-D gel that correspond to differentially

modified forms of the same protein. Phosphorylation can be observed in the case of different

spots positioned in a horizontal line, due to a change in the protein’s charge (pI) with only a

negligible change in molecular weight (Figure 1.9). Glycosylation of proteins (the addition of

chains of carbohydrates) causes a change in molecular weight and pI, causing variant forms of


Figure 1.9: A two dimensional gel highlights possible different phosphorylation states ofProtein disulfide isomerase from a human cell line (image courtesy of M. Nelson).

proteins to appear in a diagonal line. MS can conclusively identify modifications, for example

if the peptide mass fingerprint reveals a peptide with a shift in mass that corresponds exactly

to the known mass of a modification type. Tandem mass spectrometry is even more accurate,

and can reveal the exact amino acid position of the modification if one amino acid displays a

characteristic increase in mass. However, there are several problems using this technique on a

large scale. Firstly, phosphopeptides are low in abundance and extract poorly from gel slices.

Secondly, during MALDI-TOF only a proportion of peptides reach the detector, therefore

often they may not be detected. Thirdly, while using electrospray ionisation, phosphorylated

peptides ionise poorly in acidified solvents. Finally, in MS/MS the situation is worse, as only

a few peptides in the entire sequence may be detected, therefore the majority of the protein

sequence is not analysed, and modifications on the rest of the protein are silent.

There are methods for improving the detection of modifications including the use of

affinity columns that bind phosphorylated proteins [103], to enrich for these proteins as they

often occur as a small proportion of the total amount of a single protein. One such method

is Immobilized Metal Affinity Chromatography (IMAC) in which columns are loaded with a

metal ion-containing resin that causes phosphopeptides to bind under acidic conditions [354].

Other techniques used to identify modifications include Western blot analysis whereby pro-


teins are treated with specific antibodies that are known to bind particular phosphorylation

sites on peptides. The antibodies can be fluorescently labelled, allowing differences in fluo-

rescence signal to detect the amount of phosphorylated protein. A similar approach is the

use of autoradiography, whereby radiolabelled 32P is incorporated into proteins, which can

then be quantified [179].

The development of new techniques means that data sets of PTMs are rapidly increas-

ing in size, and good database support is required to make the information available to

researchers to avoid manual analysis of the literature. One estimate suggests that there are

at least 200,000 published PTMs in PubMed [285]. It is a major research challenge to make

the information on PTMs available in the context of large scale investigations.

1.2.5 Case studies of proteomics research

In this section, examples are given of proteomic investigations we have studied. Chapters

3 and 4 will return to this topic and discuss the development of standard data formats for

proteomics, and Chapter 5 will outline a database system that has been implemented to aid

research.

A major part of the development process of the standard was the capture of the re-

quirements of proteomic research. Three case studies of current research activity at the

University of Glasgow, which use proteomic techniques, have been performed. Two case

studies of research in parasitology are summarised below (Case Studies 1 and 3), which

ultimately contributed to the work described in Chapters 6 and 7. Case study 2 outlines

a collaboration at the Beatson Institute3 with the research group of Prof. Walter Kolch,

investigating the MAP Kinase signalling pathway. The data from case study 2 were not

available for inclusion in RAPAD but the experimental setup was taken into consideration

during the development of the model presented in Chapter 3.

1.2.6 Case study 1

This case study is derived from work with researchers in the field of microbial pathogenesis

[61] at the Institute of Biomedical and Life Sciences, University of Glasgow. The researchers

wish to investigate the changes that occur in the proteome of a human cell line (the host)

during invasion with the parasite Toxoplasma gondii compared with non-infected host cells.

A set of replicate samples are obtained and the proteins are extracted from each sample,

3The Beatson Institute for Cancer Research, www.beatson.gla.ac.uk.


solubilised and separated by 2-DE. The gels are scanned and image analysis is performed

to match spots on different gels corresponding to the same protein. Protein spots showing

differential expression are extracted from the gel and prepared for MS. Many proteins are

identified conclusively by MS. The next stage involves characterising the large number of

hits that are obtained. There are a large number of Internet accessible resources about

human proteins which can only be searched manually. This process is very time consuming

for a large data set. If database searches could be automated, many more proteins could be

analysed in one study, and greater insights could be made. After a long period of manual

database searching, a significant amount of information is obtained about each protein, but

there is no simple mechanism for summarising or managing the information.

The researchers also wish to identify post-translational control mechanisms, to determine

if a protein expressed during parasite invasion has been modified, compared with the same

protein in non-invaded cells. Potential modifications can be found by 2-DE if a protein

migrates to a different position on one gel compared with another gel, the result of a slight

change in the charge or molecular weight of the protein caused by the modification. The

modification can be positively identified on an MS trace by discovering a peptide with a mass

that is different from the expected value, and the difference corresponds to the mass of an

additional group, such as an extra methyl residue. However, to discover modifications that

are functionally important, the researcher must have information about how the protein is

modified in other conditions. These efforts are hindered because there are no major databases

of MS traces or modifications available. An annotated database, containing a large number

of MS traces, would greatly improve the identification of modifications in two ways. Firstly,

annotated traces for proteins with confirmed modifications could be mined to improve the

algorithms for the detection of modifications in other proteins. Secondly, if a particular

protein already has an entry in the database, differences in the modification pattern could

be highlighted, and investigated further to determine if the modification is significant for the

function of the protein.

1.2.7 Case study 2

This case study was conducted at the Beatson Institute, in collaboration with Prof. Walter

Kolch. A cell line was obtained in which the protein Raf-1 is knocked out. The protein

is known to be involved in major metabolic processes in the MAP kinase pathway [38],

and researchers wish to discover the downstream affects from the loss of Raf-1. Gels are


run using a difference gel electrophoresis system, labelling proteins from the knockout cell

line with one dye, and from a normal cell line with a different dye. A series of replicates

are run, and the gel images are analysed. The researcher has a number of questions they

wish to pose. For example, which spots show significant differential expression between

the samples, and what the identities of these proteins are. After statistical analysis, two

hundred spots showing the greatest difference in expression are highlighted for further study.

The two hundred spots are robotically picked from the gel and prepared for MS. MS traces

are analysed, peak lists are produced and entered into applications that search genome

databases. The searches identify approximately one hundred and fifty proteins that reside in

databases, of which many have only basic functional annotation. The researcher wishes to

further characterise the proteins by searching other relevant databases, of which about ten

exist. The researcher must manually browse Internet sites to assemble information and read

bibliographic references which takes a number of hours, or up to days, if extensive literature

searches are required, for each protein. Therefore, to characterise all one hundred and fifty

proteins in detail could take several weeks for a single researcher.

Once the proteins have been characterised, the researcher wishes to build a mathematical

model of the changes that occur in the metabolic pathway, caused by the loss of function of

Raf-1. Data for the model are to be drawn from the 2-DE studies, a microarray experiment

that has been carried out by another research group on the same cell line, and biochemical

studies carried out over several decades by many different research groups. The process

of retrieving data from the biochemical studies is extremely laborious because little of the

data reside in accessible databases, therefore extensive literature searches are required. The

microarray data sets have been published by other research groups, and are available on

the Internet, but do not have any information about how the cell lines were cultured. In

addition, the database identifiers (accession numbers) for the features on the microarray do

not match the identifiers of the proteins identified by MS. Therefore, it is not possible to

make any direct comparison with changes observed in the 2-DE studies. The major problems

highlighted by this case study are lack of tools for the integration of data from distributed

databases, and insufficient information stored with published data for it to be re-used.

1.2.8 Case study 3

This study was performed with Prof. Mike Turner at the Institute of Biomedical and Life

Sciences, University of Glasgow, in the context of an investigation to determine the proteome


of the parasite Trypanosoma brucei. The genome sequence of T. brucei is nearing comple-

tion, but many genes have little functional annotation and it is hypothesised that proteome

investigations can aid the annotation process. The data from this investigation form the

basis for Chapter 7.

Proteomics experiments can aid annotation efforts by conclusively identifying proteins

that are expressed under particular conditions. There are many other examples of published

work in which researchers have used proteomics techniques to catalogue the set of proteins

present in a sample of interest, to determine the entire proteome of particular cell types,

organelles or microorganisms (examples include whole yeast cells [115], the human heart

mitochondrion [316] and the plasma membrane of yeast [223]). The organism being studied

may have no genome sequence, or the sequence may be incomplete, therefore there are

significant problems conclusively identifying spots found on a 2-D gel. In some cases, several

2-D gels may be run to separate proteins within different pH ranges. Spots from the gels

are picked, and prepared for MS. Four scenarios for the results of database searches with

peptide masses obtained from an MS trace are possible:

1. A good match to a sequence in the genome database, with functional annotation.

2. A match with no annotation but with homology to sequences from other organisms.

3. A match with no annotation and no homologous sequences.

4. No match in any genome database, for example if the identification has been made

from an expressed sequence tag (EST) database.

Genome sequencing and annotation work is only partially complete for many organisms,

therefore a major problem arises due to the dynamic nature of the sequence databases. After

the release of each new database version, sequences are more likely to be found in groups

1 and 2. However, it is extremely difficult to identify which sequences have been updated

between database versions and the information cannot be accessed without repeating all

the initial searches. The sequence identifiers may also change between database releases,

therefore automating the process of searching for protein records that have been updated is

a major challenge. The sequence of peptides from an MS/MS experiment can also be used

to discover new genes within the genome, or act as an identifier for genes that previously

had not been sequenced, that fall into category 4.


1.2.9 Publication of proteomics data

There is a growing body of publications in which researchers have utilised a global approach

to study the proteins in a system. A search of PubMed for the word “proteomics” returns over

3500 articles (July 2004). Articles describing gel based proteomics usually include a printed

image of one or more gel, often with a table containing proteins that have been identified

(example [208]). In some cases, there is a comparative analysis across several conditions and

the ratios of the volume of proteins are displayed in a table (example [129]). Experiments

involving different separation techniques, such as liquid chromatography, coupled with MS

for protein identification often display the chromatograms for the different fractions, and

images of MS traces (examples [369, 361]). The proteins that have been identified are also

usually presented in a table. Most publications reproduce the protocols for MS, and a

reference to the software used for protein identification, but rarely is there any detail about

the input parameters for the software or the version of the database that was searched, and

there is variability in the significance cut-off that was used for protein identifications. It

is therefore often not possible to assess the statistical probability that proteins have been

correctly identified without substantial manual effort.

The data from proteomic studies are usually not open to any kind of automated analysis,

even if publications are reproduced electronically on the Internet. This is because the results

are often embedded within images, which cannot be extracted, or the results are written in

the main body of text, which must be read manually to understand the context. This cannot

be automated using current information retrieval techniques. We focus on the challenges of

making proteome data widely accessible in Chapter 3.

1.3 Gene expression techniques

The techniques described above attempt to assess the status of the proteins within a system.

However, the experiments present technical challenges due to the difficulties of extracting

very low volumes of proteins from the cell. There is also no technique for amplifying the

volume of a protein, which is equivalent to PCR (polymerase chain reaction) for amplifying

nucleic acid sequences. Therefore, in the last decade, techniques have been developed for

assessing how strongly genes are expressed by measuring the messenger RNA (mRNA) levels

produced. These techniques are described in this section.


1.3.1 The development of microarrays

Microarrays were first developed in the mid 1990s from two different approaches. One of

the first developments in microarrays was achieved by Shalon and colleagues in 1996, who

developed a protocol for attaching DNA fragments to a glass slide, and hybridising two sets

of yeast chromosomes, labelled with different fluorophores [287]. A paper was published later

that year by DeRisi and colleagues outlining how microarrays, formed by spotting cDNA

(coding DNA) onto a slide, can be used to assay gene expression in the context of classifying

differences in human tumour cell lines [76]. A different article was published at the same time

outlining the use of microarrays for detecting mutations in a gene implicated in breast cancer

from a number of patients [145]. Each cDNA “feature” corresponds to the complementary

sequence of the mRNA that is produced for each gene to be assayed.

Affymetrix arrays

An alternative approach was pioneered by the Affymetrix company in which very short (10 -

50 base pairs) stretches of DNA are synthesised on the chip using a technique inherited from

the semi-conductor industry, called photolithography [5]. Short sequences of DNA bases

(oligonucleotides) are synthesised on the chip, one base at a time in specific positions. The

process uses fine masks over the chip that allow light to reach particular positions, which

causes the specific degradation of a “blocking residue” that prevents additional bases be-

ing added to an oligonucleotide chain. The chip is then washed with a solution containing

whichever base (A, C, G or T) is required in the next position at the unmasked oligonu-

cleotide, attached to a new blocking residue. A new mask is applied and the next set of

bases are added (Figure 1.10). In this way, chains of nucleotides can be built up one base at

a time.

Measuring expression

Using either of the two approaches outlined above, the result is a chip or slide containing

up to tens of thousands of reporters. Each reporter detects the level of expression for

one gene. When a gene is expressed, mRNA is produced as a signalling molecule, which

is later translated into a protein, the functional unit in the cell. It is believed that the

relative amount of mRNA in one cell compared with another is indicative of the rate of gene

expression and can give insights into the genes that cause the differences between samples.

Two sets of mRNA from samples produced under different conditions (example: one normal,


Figure 1.10: A summary of the technique involved in the creation of Affymetrix microarrays,image obtained from [5].

one disease) can be labelled with different fluorescent compounds (one red, one green) and

attached to the array. The ratio of red to green for each reporter gives the difference in

expression for each gene between the two samples. For Affymetrix arrays only one sample

is assayed at a time (a one-colour array), and two different samples must be compared on

two different hybridizations to the chip. Statistical processing is performed to ensure that

values obtained from different assays can be compared. Large changes in expression for a

gene, between a normal and a disease sample, may implicate the gene in the disease process.

Since the early days of research the use of microarrays has grown at a remarkable rate.

A simple search of PubMed for the word “microarray” reveals almost 6000 articles published

since 1996. Each experiment generates a large amount of data, most studies involve many

parallel assays, with each assay containing thousands of data points. Therefore, as a general

estimate, each published study could generate several hundred thousand data points. In ad-

dition, we should also consider the genes’ annotation, experimental protocols, and statistical

processing. The challenges in database support for microarrays are clearly very large. These

requirements were realised by the MGED (Microarray and Gene Expression Data) society

in the late 1990s [42], which was established to improve support for publishing, querying

and exchanging microarray data sets. The issues of data standardisation, and the creation

of public databases, are discussed in the following chapter.


1.3.2 Serial analysis of gene expression

The technique of serial analysis of gene expression (SAGE) was first reported in 1995 by

Velculescu and colleagues [332] as a method for quantifying the expression of genes, prior

to the invention of microarrays. The basic principle is that short tags (10-14 base pairs),

which uniquely identify the transcript of the gene, are obtained for each gene to be assayed.

A sample is obtained, and the tags are isolated from the transcripts, reverse transcribed

(converting mRNA back into DNA), and concatenated to form a long stretch of DNA. The

newly formed DNA is sequenced, and the number of times each tag appears indicates the

level of expression of each gene. The technique has been successfully used to assay the

expression of over 4000 genes in yeast in 1997, which was one of the first examples of a

technique to perform high-throughput analysis on a whole system [333].

1.4 Other techniques used in functional genomics

The main focus of our research is to improve computational support for proteomics, and to

integrate the results of protein abundance experiments with gene expression values. However,

it is also important that technology can be extended to capture and integrate data from all

types of functional genomics experiment. This section contains a brief overview of other

types of large scale experiments which may yield data needed for functional genomics.

1.4.1 RNA interference

RNA interference (RNAi) is a technique first developed in Caenorhabditis elegans [108]. It

is a powerful method for removing the function of a gene without having to develop genetic

crosses, or engineer complex methods for deleting the gene from the genome. In certain

species, simply injecting the organism with double stranded RNA of the same sequence as

the targeted gene, prevents the gene being translated into protein. The same effect can also

be achieved to a lesser extent using single stranded anti-sense RNA. The resulting phenotype

of the gene knockout allows researchers to assign a function to a gene, as long as the knockout

is not lethal, and this has proved vital for investigating C. elegans. The vast majority of the

predicted 20000 genes have been tested with RNAi. Similar experiments have been performed

in plants, in Drosophila and in the disease causing parasites, trypanosomes (a review is given

by Hannon [147]). There is some evidence that RNAi may be effective in mammalian cells,

although this has not yet been conclusively demonstrated, and the complete mechanism for


RNAi is not currently understood. However, RNAi is a highly specific technique that allows

researchers to determine the function of genes on a large scale.

1.4.2 Immunohistochemistry

The position of a protein in a cell or tissue can be localised using immunohistochemistry,

which is a widely used technique in molecular biology [69]. A particular protein can be

viewed under a microscope using a specific antibody to which a fluorescent label has been

attached, such as green fluorescent protein (GFP [163]), or a radioactive tag. More generally,

proteins can be visualised in a sample using silver staining.

The position of the protein in the cell can be visualised, and differences in the pattern

of labelled proteins can be used to classify samples. Localisation information may provide

clues to the function of a protein. For example, a protein shown to be highly expressed

in cell membranes may prove to be a transporter or membrane receptor. The technique

can be modified to visualise two proteins concurrently, using two different fluorescent labels

attached to antibodies against the two proteins to be studied. In one study, 75% of the yeast

proteome was analysed by this method, totalling 4156 proteins, allowing researchers to infer

significant functional information [157].

1.4.3 Metabolomics

Proteins and mRNA sequences are not the only molecules that can give information about the

current state of a system. Biological reactions are catalysed by proteins, but the reactants are

in fact small molecules, such as citrate, glucose or NADPH, known as metabolites. Researches

have developed techniques to analyse the metabolites within one system compared with

another, for example to determine the difference between bacterial strains, or to analyse

the critical changes in metabolite concentration during a disease process. The study of the

entire set of metabolites as a diagnostic tool has become known as metabolomics (current

progress is reviewed by Weckwerth 2003 [346]), and the term metabonomics has also been

used. According to Nicholson 2002 [227], metabonomics is the study of metabolic profiles in

vivo in whole organisms, biofluids or tissues.

In theory, mass spectrometry could be used directly to detect the metabolites present

in a sample, by detecting the mass of all the metabolites and comparing with a reference

database. In practice, an additional stage is used to separate metabolites according to their

molecular mass, prior to MS, to increase the resolution. The additional stage can be liquid


or gas chromatography (GC) [105]. The principle of GC is similar to LC but uses a column

filled with an inert gas, rather than a solution. The mixture undergoes a process that causes

it to become gaseous, and small molecules separate according to a property, such as mass

or charge. There have been several studies that determine the metabolites present in plant

samples using LC/MS or GC/MS, examples include [271, 105, 347].

An alternative approach for determining the metabolome is nuclear magnetic resonance

(NMR) [306]. NMR can detect a fingerprint of the metabolites in a sample that contain 1H,

13C, 15N, or 31P when pulsed with a radio frequency. The atomic nuclei give information

about the chemical environment within a magnetic field. NMR has the advantage over

MS that it is not destructive of the sample, and in some cases can be used in a non-invasive

manner for analysing tissues. This kind of metabolomics is used for diagnostics, to determine

the characteristic fingerprint of the metabolites present in a particular bacterial strain, or a

diseased tissue.

1.4.4 Protein interaction studies

Proteins rarely act as single units in cells, but form complexes with other proteins to create

new functions. It is therefore an essential part of functional genomics research to gain

insights into the interactions partners for proteins. The main experimental techniques for

such studies are summarised here.

One of the main technologies developed in the late 1980s is the Yeast Two-Hybrid system

that works in the following way [107]. The DNA binding domain of a transcription factor A

is fused to protein X, and the activation domain of transcription factor A is separated and

fused to protein Y. Transcription factor A switches on a gene that causes a visible change in

a cell culture, causing cells to grow rapidly, or a particular colour to develop. Transcription

factor A can only switch on the gene if its two domains come into contact, caused by protein

X and protein Y interacting (Figure 1.11). The two-hybrid method has been employed on

a large scale to analyse protein interactions in yeast [323], C. elegans [343] and Helicobacter

pylori [263]. In the study on yeast, researchers plated 192 “bait” proteins, and assayed almost

all of the 6000 predicted proteins as “prey”, revealing 281 protein-protein interactions. The

reverse study was also performed, using all the predicted proteins as bait against a library

of prey proteins, revealing a further 700 protein interactions. This system has proved vital

for determining functionally significant interactions, however it has disadvantages [56]. It is

based on transcriptional activation, thereby forces interaction partners to localise together


Figure 1.11: A summary of Yeast Two-Hybrid experiments, reproduced from [56].

in the nucleus producing a large number of false positives. Therefore, other methods are

usually required to confirm the interactions identified by Yeast Two-Hybrid analysis. In

addition, the fusion of proteins to the transcription factor domains may block sites required

for interactions, or required for modifications that must occur before interaction, such that

they may be missed.

An alternative method for detecting protein-protein interactions is affinity purification

of multiprotein complexes. In this method a single protein A is fused to a tag that can be

purified using an antibody that is attached to an affinity column. Proteins that bind to A,

forming a complex, can be pulled out. The complex is separated on a one or two dimensional

gel and identified by MS. This system has been used in yeast to identify 3617 interactions

with 493 baits [152]. A similar method is tandem-affinity purification (TAP tagging) in

which protein A is fused to a tag that binds IgG beads in a column [120, 270]. Other proteins

interact with protein A forming a complex. The TAP tag contains a highly specific protease

cleavage site to enable the complex containing protein A to be extracted from the column

without disrupting the interactions. The proteins within the complex can subsequently be

identified by gel electrophoresis and mass spectrometry. The affinity based methods have

the advantage over Yeast Two-Hybrid that interactions take place under conditions that are

much closer to natural cellular conditions, although interactions may not be detected if the

interacting proteins are not in high abundance.

A new advance in understanding protein interactions is the development of protein mi-


Figure 1.12: Affinity methods for assaying protein interactions, reproduced from [56].

croarrays (or protein chips) [251]. The basic technique involves immobilising a set of recom-

binant proteins to a surface, such as a membrane or slide. The chip can then be assayed with

a protein or antibody attached to a fluorescent molecule. Any protein spot that fluoresces

is likely to be an interaction partner for that protein or antibody. Multiple proteins can be

tested against the chip in sequence, to generate data about protein interactions on a large

scale. There are currently several technical difficulties with the production of protein chips.

However, although protein chips are still at the “proof-of-concept” stage, new techniques

for printing protein spots, immobilising correctly folded proteins and detection should soon

make this technique widely available to researchers, enabling rapid, large scale surveys of

protein interactions.

1.4.5 Three dimensional structures

The three dimensional structure of a protein is one of the most insightful pieces of infor-

mation about its function, particularly if a structure is obtained in which a ligand is bound

to the active site. The resolution of 3-D structures is a major research field and might be

considered outside the scope of functional genomics. However, in recent years an effort has

been initiated to perform high-throughput generation of protein structures, that has been


termed structural genomics, or structural proteomics [360]. Large collections of recombinant

proteins are screened in parallel for the ability to form crystals, each using a range of ex-

perimental conditions. An early example of the success of this approach was demonstrated

by Christendat and co-workers in 2000, in which 10 structures were published simultane-

ously [57]. In the protein data bank (PDB) there are over 26,000 structures in July 2004

and this number is likely to increase exponentially as the structural proteomics effort gains

momentum.

1.5 Investigations across the “omics”

Large scale investigations are being undertaken in many labs, working on a great range of

organisms. The techniques used depend upon the organism, for example in the nematode

worm, C. elegans, RNAi is one of the best methods for investigating the function of genes (a

review is given by Lee and colleagues [191]). However, RNAi is not a viable method for some

other species. In mice, more common techniques include the development of “knock-out”

mice, whereby targeted recombination replaces a specific gene in embryonic stem (ES) cells.

The ES cells are then injected into blastocysts, which can form embryos when implanted in

a pseudo-pregnant mouse. The resulting litter contains certain mice with the gene knocked-

out, from which a strain of mice can be developed. The phenotype of the resulting strain

gives information about the function of a gene [308].

A summary of the FG approaches that have been used in yeast is given by Castrillo and

Oliver [50]. Yeast has been a very important model organism, and many of the techniques

described in this chapter were first developed in a yeast model. Current investigations in

yeast focus on finding all the genes in the genome, using bioinformatics approaches [190]. In

addition, various high-throughput approaches have been used to study the transcriptome4

[165], proteome [115] and metabolome [9]. Investigations in parasitology form the basis for

the work in chapters 6 and 7, and FG studies on other organisms are too numerous to cover in

detail. However, in the following section a brief description is given of studies in which more

than one type of approach has been used to study a system, such as genome, transcriptome,

and proteome analysis.

4The transcriptome is the complete mRNA abundance of a sample.


1.5.1 Comparative studies

There are several examples of published work in which researchers have characterised a

biological system by applying more than one type of functional genomics technique, and

in the next few years it will become common for researchers to perform parallel analysis

of the transcriptome, proteome and metabolome. In 1999, two papers reported similar

analysis on yeast to determine the global gene expression and to compare this with protein

abundance data, in an attempt to find the correspondence between the rate of transcription

and translation [115, 143]. The paper by Futcher and colleagues [115] compared protein data

from 2-DE, using LC-MS for identification, against mRNA data from SAGE and microarrays.

The results suggested that the correlation between gene and protein expression is high. They

found that approximately one molecule of mRNA gives rise to 4000 molecules of protein. The

study published early that year by Steven Gygi at the University of Washington compared

data from 2-DE and SAGE [143], and found a very poor correlation between gene expression

and protein abundance [143]. In their study, certain groups of proteins that had the same

level of abundance had mRNA levels that varied 30-fold. Conversely, genes with similar

levels of mRNA produced proteins that varied up to 20-fold in volume. The difference in

the two studies may result from anomalies in the experimental techniques that produced the

data, or the statistical model used to perform the comparison.

A study by Lee and colleagues in 2004 performed comparative analysis of gene expression

and protein abundance in yeast, using microarrays and 2-DE, to establish which genes and

proteins were up-regulated in a particular mutant strain [192]. Fifty-four genes out of 4290

assayed were found to have differential expression assayed by microarrays. Eighteen differ-

entially expressed proteins were observed by comparative 2-DE analysis, of which 14 were

identified by MS. The study revealed that many of the sequences differentially expressed in

both analyses had similar functions, but the overall data sets were too small to perform any

kind of statistical correlation analysis between the rate of transcription and translation. This

study exemplifies the current problems hindering large scale comparison of microarray and

protein abundance results. There are few studies that make protein abundance data publicly

available, and therefore it is difficult to determine how accurately the level of mRNA predicts

the volume of the corresponding protein. For this to be possible, data must be pooled from

several different studies, which requires the deposition of experiments in a public repository,

where the results are formatted in a standard way. Moreover, it is likely that there will

be significant variation in the relationship between mRNA and protein production. This


might occur both at the protein class level, and at the species level. Thus, the discovery of a

single process to govern transcriptional control of protein production may be unlikely. The

problems of standardisation, and public deposition of data, are addressed in the following

chapter.

There are many large FG studies that are currently being performed on a variety of

organisms, and in the next few years it is likely that studies analysing more than one level of

the central dogma5 will become widespread. It is clear from the studies that microarrays are

a powerful tool for finding genes that have an important role in a process, but single data

points may not be able to predict accurately the abundance of functional protein, if analysed

independently of the entire data set. Protein abundance values may be a more accurate

measure of the amount of functional material but the experiments are less reproducible, and

cannot be performed at the same throughput level as microarrays. Therefore, a combination

of approaches will provide a more complete picture of the status of the system and the data

will feed into models of cellular and physiological processes, allowing the vision of systems

biology (as described in Section 1.1.2) to be realised. The issues involved with integrating

data from microarrays and proteomics are explored in detail in Chapters 5 and 6.

1.6 Summary

The techniques described in this chapter provide insights into gene and protein function,

with new technological developments allowing researchers to generate very large data sets

on a previously unimaginable scale. The monetary cost of such ambitious experiments is

extraordinarily high, since they are dependent on an expanding range of complex machinery

requiring high levels of technical expertise. Therefore, there is an economic requirement

to maximise the amount of information from each experiment, and to provide flexible data

storage capable of repeated interrogation.

An important consideration is how to interpret data from large scale approaches, and how

to place statistical confidence on findings derived from the data. It is critical that more than

one experimental approach is utilised, for example microarray results are often confirmed

by PCR or Northern analysis, and differential expression of proteins can be confirmed using

antibodies in a Western analysis. The combination of results from more than one level of the

“omics”, for example comparing mRNA and protein level, will enable much higher confidence

5“The Central Dogma of Molecular Biology” was proposed by Francis Crick to explain that the informationflow usually ran from DNA to RNA to protein [64].


to be placed on functional assignments. The data sets will ultimately feed into models that

are used to generate an overview at the level of the whole system. Before this can be achieved,

a significant body of work is required to improve public databases for functional genomics

data, and community wide agreement is required on standard formats to which published

experimental data must conform. An overview of the current work in this area is the focus

of the following chapter.

Chapter 2

Databases, standards and

ontologies for the life sciences

2.1 Introduction

In the previous chapter the techniques that comprise functional genomics research were

described, along with the computational challenges they present. In particular, the focus

was on proteomics research, for which we have developed proposals for a data standard, and

a new database system, described later in the thesis. This chapter contains a description of

the major research developments in database technology for functional genomics (FG) and

other life sciences domains. FG experiments require the development of standard formats

for transferring data between research groups and sending datasets to central repositories.

Ontologies are controlled vocabularies of terms describing a particular domain, and are vital

for data interchange and archiving in FG. Current advances in standards and ontologies are

described.

2.1.1 Computational support for the life sciences

In theory, building good databases for life sciences should be no different from building con-

ventional databases for commerce, banking and industry, however in practice there are a

number of key differences. Relational database management systems (RDMS) have been

designed to support commercial applications with relatively simple data types: most con-

cepts required for a banking database can be represented by strings, integers and floating

point numbers. In addition, this area is standardised to a large extent, as there are well

designed packaged solutions that can be purchased. The huge growth in life sciences data, to

which massive public access is required, presents new challenges to the database community.

Consider that the human genome sequence, even without the annotation of genes, is a set

40

Chapter 2. Databases, standards and ontologies for the life sciences 41

of 3 billion characters, which must be queried in a number of different ways. It is not easy

to query DNA code stored in tables in an RDMS, therefore additional indexes and software

have been designed de novo and run alongside database applications to provide access to

the data. The situation in functional genomics research is even more complex due to the

heterogeneity of data sets produced by different laboratories.

In proteomics, high resolution images of 2-D gels are an integral part of a data set, to

which significant information must be attached. RDMS can store images, but do not offer

any facilities for querying data within images, or any image comparison. As the field of

proteomics is developing rapidly, there are frequent changes and improvements in the types

of experiment, in laboratory equipment and new software. The number of different data

formats that a bench researcher must deal with is large, and providing an integrated view of

all the data within even a single experiment is a challenge. Once a study has been performed,

researchers often spend significant periods of time searching online databases to characterise

genes and proteins that have been highlighted by their study. Each year the Nucleic Acids

Research journal (NAR) has a special issue, the Molecular Biology Database Collection,

describing all the databases that are freely available over the Internet [117]. In 2004, the

collection contained 548 different databases, many of which are relevant to functional ge-

nomics. Most databases can be queried via the Internet, but the results of queries are often

embedded in web pages that are very difficult to process automatically. Alternatively, many

databases offer a download of their entire contents in a bespoke text format that requires

specific software for handling. A complete data set assembled by a researcher could contain a

great variety of file formats, high-resolution images with annotation, experimental protocols

written in lab books, and large quantities of raw and statistically analysed data. It is vital

that experimental data is made available to other research groups. The publication of results

only in journals is no longer sufficient because data sets are simply too large to comprehend

by reading alone. Research is required to develop local databases for laboratory manage-

ment, and centralised public repositories [273]. Standardisation of formats must occur to

enable developers to create software that can process results into a single file that can be

used for sending data to centralised repositories, or to other research groups.

2.1.2 The future accessibility of data

The remarkable growth of the World Wide Web in the last decade has changed the face of

business and research, by enabling information to be made globally accessible, in an instant.


The Web has altered the way scientists publish their data, as almost all journals are now

accessible on the Internet, and can be searched very rapidly with an index. Our libraries

are not yet defunct, but are certainly under threat. This model of Web publishing is still

far from ideal because almost all web pages are intended to be read and understood by

people, and not by computer systems. Additional software has been created to allow the

Web to be searched, but the search engines utilise only a fairly simple index of the text in

web pages, and generally ignore the context. For example, it would be desirable to be able

to find automatically all the databases in the NAR Molecular Biology Database Collection

which contain information about proteins, query them for a specific protein, and summarise

the results. Unfortunately this will not be possible in the near future because there is no

standard mechanism for automatically discovering the types of data stored in a particular

system, or how they can be accessed. The solution to these problems may be found by the

Semantic Web [342], the next generation architecture of the Web.

The Semantic Web has been proposed by Tim Berners-Lee, the founder of the WWW, as

a global network of resources that are machine understandable [31]. The basic premise is that

web sites will be created using technologies that allow them to specify the objects described

in the web pages, the relationships between objects and how the web sites can be accessed.

An essential component will be ontologies, which are controlled vocabularies containing terms

that have a strict definition, and a specified source location, to ensure that a version of a

term is used in different contexts with exactly the same meaning. Ontologies can contain

a set of rules associated with terms, which allow the terms to be processed in computer

systems. Software can discover the relationships between terms, and perform reasoning,

to ask logical questions of a resource described using an ontology [133]. A hypothetical

biological example is as follows. All databases within the NAR Molecular Biology Database

Collection are made accessible through the Semantic Web, using a software package that is

freely available, similar to the HTML editors that are used to produce current web pages.

A database specifies what it contains, such as the three-dimensional structures of proteins,

and that it can be accessed by querying with a URL (Uniform Resource Locater) followed

by the term ?query=PROTEIN NAME. The terms that describe the contents and methods of

accessing the database are obtained from a controlled vocabulary that resides elsewhere on

the Web, to ensure that the same terms are used by different databases. Software can then

be developed that automatically discovers the 3-D structure database, queries it for a protein

name, and processes the results as required by the user.


This has clear implications for biomedical research, and it is one of the areas that will

benefit most from the Semantic Web [173]. The life sciences, unlike the axiom-based sciences,

rely on knowledge acquisition about a domain, and have been subject to an unavoidable

historical bias caused by the interests of the particular researcher investigating an area. The

advent of functional genomics removes much of the bias because, rather than an experiment

being designed to test a hypothesis, the experiment itself generates hypotheses about the

function of genes, proteins or entire systems. The results presented in a journal publication

could still be focused on a researcher’s particular interests, but the whole data sets will often

contain far more information than is highlighted in the original publication, which could be

valuable to many other research groups. The Semantic Web has the potential to maximise

the knowledge derived from a single experiment, by making it as widely accessible as possible.

For a knowledge-based science, clearly this will be a major advance.

The Semantic Web will be built using a number of technologies, of which several al-

ready exist (described in Section 2.2). Extensible Markup Language (XML) has become the

primary notation for exchanging information over the Web, and most standard formats for

the life sciences are expressed in XML. XML itself cannot express how concepts are related

to each other, this functionality is offered by the Resource Description Framework (RDF)

which can describe the location of objects on the Web, and how objects relate to each other.

Finally, the development of ontologies will be vital for ensuring that terminology is used in

a standard way, and various formats for expressing ontologies have been developed. Current

progress in ontologies for biomedical research is presented in Section 2.5.

The vision of the Semantic Web may be realised in the next decade, but in the nearer

term many of the concepts can be applied now, to improve the facilities for data publishing

and exchange. The results of functional genomics experiments must be made accessible in

public databases. Later in the chapter there is a description of the public databases that

currently exist for functional genomics data (Section 2.4), although neither the problem of

developing standard access methods, nor the challenge of data integration (Section 2.6), have

yet been solved. The development of central repositories is not possible without standard

exchange formats that researchers must use to express their data sets. A description of

current developments in data standardisation is also given (Section 2.3).


2.1.3 Guide to the chapter

The structure of the chapter is as follows. The formats used to express data standards and

ontologies are described first (Section 2.2). Since the development of public repositories is a

major challenge without common data formats, previous work in standardisation is described

in Section 2.3. A summary of databases that have been developed for life sciences is presented

in Section 2.4. There are a number of newly established efforts to design ontologies to capture

biological information, described in Section 2.5. Finally, there are major efforts by a number

of research groups to bring all the diverse parts of related information together in common

systems (data integration), described briefly in Section 2.6.

2.2 Technology required for data standards

2.2.1 Extensible Markup Language: XML

The emergence of data standards has been tied to the rise in usage of Extensible Markup

Language [101] (XML) as a data interchange format in e-commerce, industry and research.

The importance of XML for bioinformatics has been recognised for some time [2]. An XML

document has a hierarchy of tagged elements, in which the name of the tag describes the data

type that follows. XML has been described as semi-structured data because the document

is self-describing [44], unlike the tuples1 in a relational database, which have little meaning

in the absence of the database schema. An example of a partial record in the native format

from the PIR (Protein Information Resource) database [254] is given (Figure 2.1), along with

the same data stored in XML (Figure 2.2) and a representation of how the same data could

be stored in a relational database (Figure 2.3).

XML has become the most commonly utilised format for expressing data standards and

ontologies because there are a large number of applications that can automatically process

XML documents [279, 82], unlike bespoke text formats that require processing software to

be re-written every time there is a change to the format. Many life sciences databases now

offer a bulk download in XML format that could be used for data integration, as described in

Section 2.6. Data represented in XML can be validated using a document that specifies what

elements and relationships are allowed in the XML. The current specification for validation

documents is XML Schema [341] that has superseded the initial proposal of the Document

Type Definition (DTD) [75].

1A tuple is a term for a row of data in a table of a relational database.


ENTRY CCHU #type complete iProClass View of CCHU

TITLE cytochrome c [validated] - human

ORGANISM #formal_name Homo sapiens #common_name man

...

SUMMARY #length 105 #molecular_weight 11749

SEQUENCE

5 10 15 20 25 30

1 M G D V E K G K K I F I M K C S Q C H T V E K G G K H K T G

31 P N L H G L F G R K T G Q A P G Y S Y T A A N K N K G I I W

61 G E D T L M E Y L E N P K K Y I P G T K M I F V G I K K K E

91 E R A D L I A Y L K K A T N E

Figure 2.1: A partial record from the PIR database, in the native PIR format.

<ProteinEntry id="CCHU">

<protein>

<name status="validated">cytochrome c [validated]</name>

</protein>

<organism>

<source>human</source>

<common>man</common>

<formal>Homo sapiens</formal>

</organism>

...

<summary>

<length>105</length>

<type>complete</type>

</summary>

<sequence>MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIW

GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE

</sequence>

</ProteinEntry>

Figure 2.2: A partial record from the PIR database, released in XML format.


Protein Entry Table

ID Name Status Length Type Sequence Organism

CCHU cytochrome c validated 105 Complete MGD..TNE 1

...

Organism Table

Organism ID Source Common Formal

1 human man Homo sapiens

2 chimpanzee chimpanzee Pan troglodytes

3 ....

Figure 2.3: An example of how a partial PIR record could be stored in two relations in arelational database.

XML was initially intended to be a format for transferring data over the Web, and soft-

ware has been developed for processing XML into different formats or to extract information

for database storage. In recent years there has been a growing momentum towards develop-

ing methods for storage and querying of “raw” XML, because it has been recognised that the

hierarchical, semi-structured nature of XML captures the semantics of certain data in a more

natural way than a relational database representation, particularly for data that has a tree

structure. There is a substantial body of research for improving the facilities for querying

data represented in XML format, and several proposals have been made for query languages,

such as XQuery [357]. In Appendix A, there is a report of work undertaken by the author

to develop a new type of index for fast querying of biological data, represented in XML. The

index has the potential to be extended to aid data integration, as highlighted in Section 2.6.

2.2.2 Resource Description Framework

The Resource Description Framework (RDF), recommended by the W3C2, provides a way

of modelling metadata [269]. In the context of RDF, metadata is machine understandable

information describing web pages, but metadata can also have the general meaning of “data

about data”. In this sense, metadata is the real world meaning and context of the data values.

RDF is expressed in XML but, unlike XML, RDF can explicitly specify the properties of

other objects in the document, allowing automated reasoning. The following example is from

the article “What is RDF?” by Tim Bray [40]:

2The World Wide Web Consortium (W3C) is an organisation for the development of technologies and bestpractice for the Web [352].


<rdf:Description about=’http://www.textuality.com/RDF/Why-RDF.html’>

<Author>Tim Bray</Author>

<Home-Page rdf:resource=’http://www.textuality.com’ />

</rdf:Description>

In this example, the excerpt of RDF describes an article on a web page, specifying that the

author is “Tim Bray” and the home page of the web site is http://www.textuality.com. An

RDF description consists of three components: a Resource, a Property, and a Statement. A

resource is any object that has a Universal Resource Indicator (URI), such as a web page,

or part of an XML document. A property is a resource that has a name, and is a facet

of, or belongs to, another resource. In the example, the author is a property of the article.

A statement is a combination of a resource, property and value, such as The Author OF

http://www.textuality.com/RDF/Why-RDF.html IS Tim Bray.

RDF could be used in the life sciences domain, for instance to describe protein records

in a web accessible database, in which the URI of the record is the resource, and the amino

acid sequence of the protein is a property. The following statement could be deduced auto-

matically:

The Protein Sequence OF www.myProtDB.org/query?myDBId=1A1B IS "MLENT...".

The RDF representation has advantages over a pure XML representation because, while

a person viewing an XML document may be able to deduce that a protein sequence is a

property of a protein record, this could not be done automatically [228]. There are various

biomedical ontologies described below that utilise extensions of RDF. In the field of chemistry

RDF is also used, for example to express the Chemical Markup Language that enables the

interchange of molecular data [220, 131].

2.2.3 DAML+OIL and the Web Ontology Language

The use of ontologies is a major research area in the life sciences. Several examples drawn

from this area are discussed in Section 2.5. There is a formal language for expressing on-

tologies, which was originally called DAML+OIL because it resulted from the fusion of two

separate efforts [154]. It is now set to become the W3C standard OWL (Web Ontology

Language [238]). OWL is expressed in XML and uses the RDF extension. OWL is a further

extension of RDF because it specifies what the associated objects are, and how they are

related, rather than only specifying a single object with a set of properties. An ontology


expressed in OWL consists of axioms that state the formal relationships between classes and

properties.

For example, an ontology describing genes, transcripts and proteins could be defined as

follows. One relationship could be specified: isTranslated, between the class:mRNA (mod-

elling an RNA sequence record) and the class:Protein (for the protein sequence record).

The class:Protein and class:mRNA both have a textual definition that describes exactly

what is meant by the term. This representation is powerful because it allows reasoning to be

carried out by a computer system, in combination with rules over other objects. The software

could find that the protein sequence is created by translating an mRNA sequence. This kind

of reasoning cannot be done in a purely relational database system, because the semantics

of a relationship are usually only captured by a record having a foreign key that references

another table. The meaning of a relationship in a database can be open to interpretation.

A well designed ontology ensures that every concept and relationship has a clearly indicated

meaning [39].

2.2.4 Unified Modeling Language

An important component of a data standard is an object model that describes a system

independently of the technology that is used for its implementation. Object models are

most commonly expressed in Unified Modeling Language [324] (UML), which is a standard

notation designed to improve the process of developing large software systems [274]. UML

includes components that represent the design and visualisation of the architecture of a

system during development. UML supports the definition of “use case” scenarios and work-

flows which could be used to model the biological research process. UML can also be used

for database design.

The most commonly used part of UML for representing a system is the class diagram. A

class diagram represents real world objects as a set of classes with attributes of certain types

(such as strings, integers, or user-defined), and relationships between classes (see Figure 2.4).

The concept of inheritance can also be represented in UML, in which one class inherits all

the attributes and relationships of another class. It is common in class diagrams to see

multiple subclasses inheriting from a single superclass. This design is intended to reduce the

amount of code required to implement the model because the attributes and relationships

only have to be programmed once for the superclass, rather than repeating code for each of

the subclasses. The concept of inheritance is exemplified in the description of MAGE-OM,


Relationship betweenHospital and Ward

DOB: date

name: String

Doctor

telephone: int

Patient

Person

admission: date

A package forgrouping classes

Ward cannot exist without Hospital.A diamond indicates containment e.g.

Open arrow indicates inheritance.Doctor and Patient are subclassesof Person and inherit the attributesname and DOB from the superclass.

1..n1

postcode: String

address: String wardNumber: Int

WardA class representinga real world object

Attribute typeAttribute of Person

Hospital

name: StringStaff

in which the relationship should be implemented.Arrow in a relationship indicates the direction

The numbers refer to the multiplicity of the1 1..n

linked to one or more instances of Ward.relationship. One instance of Hospital can be

Figure 2.4: The main components of a UML class diagram for a hospital computer system.

the object model for microarray experiments (Section 2.3.1).

An object model enables developers to have a shared understanding of the components

of a complex system, but it can also be converted into an XML validation document and a

database schema without significant effort. Another use of UML is to support the design of

code for an entire software system, for instance to provide database connectivity, produce

output in a file format, or describe user interactions with the system.

2.2.5 The object management group

The object management group [231] (OMG) is a consortium formed to improve the interoper-

ability of software systems. The standards defined by OMG are expressed in UML, and other

notations, such as the MetaObject Facility (MOF) [231] . The main component of OMG is

the Model Driven Architecture (MDA). This is a notation for specifying the components of

large software systems for business, which is independent of the technology that will realise

them. A model is first specified in MDA, and it can then be instantiated with any program-

ming language such as Java [169], C++ [63], .NET [213], and so on. This model insulates

companies from evolution of technologies, and reduces the overhead of re-implementation.

A second benefit of ensuring that a system is described in a platform independent manner

is that it should help the sharing of applications and data across different domains. OMG

is also involved with checking the consistency of object models but it is left to domain ex-

perts to ensure that an object model correctly represents the concepts in the domain. The


OMG has been involved with verifying the object model for the microarray data standard,

described in Section 2.3.1.

2.3 Data standards in the life sciences

The problems of the incompatibility of data from different laboratories have been recognised

by researchers, leading to the development of data interchange formats. In the absence of a

data standard, even if published data is made available from authors’ web sites, the overhead

required to write software to interpret data from a number of different sources is often too

great, and the information is effectively inaccessible. A good data standard should ensure

that sufficient information is stored about the biological samples and experimental protocols

to enable future re-evaluation of the information. This is a major issue for digital archiving

because the volume of data continues to grow very rapidly. It cannot be assumed that it will

be possible to perform manual searches of the literature for all the relevant experiments in

the future, and automated methods will be required. In this section a brief introduction is

given to the established and proposed data standards.

The data format for microarrays, called MAGE-ML (Section 2.3.1), has influenced efforts

in other areas of functional genomics. The draft standard for proteomics, called PEDRo

(Proteomics Experiment Data Repository), is introduced in Section 2.3.2 and is one of the

main focal points of the following chapter. Mass spectrometry (MS) is a crucial part of

proteomic analysis, and was incorporated into the original PEDRo proposals. Data standards

for MS are now under development by a newly formed group, described in Section 2.3.4. In

the rest of the section, there is a description of other data exchange formats that are relevant

to life sciences research.

2.3.1 Microarray standards

Microarray experiments have now become widespread [55] and produce very large amounts

of data that could potentially be useful to researchers in a variety of contexts. The re-

quirements for central repositories of data, and standards for sharing and publishing, were

recognised several years ago [42]. A group of researchers formed the MGED (Microarray

Gene Expression Data) Society for improving the facilities for data sharing [212]. The first

stage of the standardisation process was the release of a checklist of information that should

be made available with a microarray data set to allow future re-evaluation of the data. The

checklist is known as MIAME [41] (Minimum Information About a Microarray Experiment).


ArrayDesign Array

BioAssayBioMaterial

BioAssayData Experiment HigherLevelAnalysis

AuditAndSecurity

BioEvent

Description Measurement Protocol

Identifiable

identifier : Str...name : String

BioSequenceBQS

DesignElement NameValueType

name : Stringvalue : Stringtype : String

0..* 1+propertySets

0..*{rank: 1}

PropertySets

1

Extendable

0..n

1+propertySets

0..n

{rank: 1}

1 PropertySets

Description

text : StringURI : String

Audit

date : Dateaction : enum {creation,modification}

Security

Describable

0..*

1

+descriptions

0..*

{rank: 1}

1

Descriptions

0..*

1

+auditTrail

0..*{rank: 2}

1

AuditTrail

0..1

0..n+security

0..1{rank: 3}

0..nSecurity

Figure 2.5: The top level of MAGE-OM, reproduced from [212]. There are fifteen packagescontaining classes to capture different parts of a microarray experiment. There are threeclasses included at the top level: Identifiable, Describable and Extendable that can beused by most other classes in the model for linking to additional attributes.

MIAME specifies the parts of experimental protocols, sample details, raw data and analysis

that must be released for an experiment to be understood and potentially reproduced, if

the same biological samples are available. The MIAME guidelines have been accepted by a

number of journals, and they must be satisfied for a publication to be accepted [23, 24, 25].

A formal specification of the microarray requirements was released as an object model,

MicroArray Gene Expression-Object Model (MAGE-OM), expressed in UML. The object

model serves two purposes. Firstly, the class diagrams allow developers to have a shared

understanding of the concepts and relationships in the standard. Secondly, the object model

has been used to generate a software toolkit, available from the MGED website, which allows

developers to create applications that process data into an exchange format, based on the

model. The data format, MAGE-ML [297] (MAGE-Markup Language), is expressed in XML,

and several major databases now accept MAGE-ML for loading data (Section 2.4.1). An

essential component of the standard is the MGED Ontology that consists of a controlled

vocabulary of terms used in microarray experiments (described in Section 2.5).


Contact

BioSource

0..n

0..n

0..n

0..n

SourceContact

BioMaterialMeasurement

NameValueType

Measurement

1

0..1

1

+measurement

0..1

Measurement

Treatment

order : int

0..n

1

0..n

1

Sources

1 0..11

+actionMeasurement

0..1

BioSample

BioMaterial 1

0..n

1

0..n

BioMaterial

0..n

1

+treatments

0..n

10..n

1 +qualityControlStatistics0..n

1

LabeledExtract

CompoundMeasurement

1

0..1

1

+measurement

0..1

0..n

1

+compoundMeasurements

0..n

1

OntologyEntry

1

1

1

1

Action

11

1type

1 Type

1

1

1

1

MaterialType

0..n

1

0..n

1

Characteristics

DatabaseEntry

accession : StringaccessionVersion : StringURI : String

Compound

isSolvent : boolean = false1..n

0..n

1..n

0..n

Labels1

0..n

+compound

1

0..n 0..n

1

+componentCompounds

0..n

1

0..1

1

0..1

1

MerckIndex

0..1

1

0..1

1

ExternalLIMS

Figure 2.6: The BioMaterial package in MAGE-OM, reproduced from [212].


The MAGE object model

The overview of MAGE-OM is displayed in Figure 2.5. There are fifteen packages, each

containing a number of classes to represent part of a microarray workflow. For example,

Array, ArrayDesign and DesignElement describe the features on a microarray, and BioAssay

describes the hybridization of mRNA to the array. MAGE-OM is designed to allow as much

flexibility as possible to ensure that it does not restrict the types of experiment that can be

captured. An example of this is in the BioMaterial package shown in Figure 2.6. The package

is intended to capture the substances that are processed at various stages in the experiment.

A BioMaterial can be one of three types: a BioSource (the source of biological material),

a LabelledExtract (for example the fluorescently labelled mRNA that is hybridized to an

array) or a BioSample (any intermediate between a BioSource and LabelledExtract). This

is an example of inheritance because the three classes inherit relationships from BioMaterial.

The use of inheritance should reduce the amount of programming required to capture this

part of the model because the relationships to other classes only need to be coded a single

time for BioMaterial, rather than three times for each of the more specific classes. One of

the relationships allows the class to reference OntologyEntry, which can be used to specify

a number of characteristics about the material, by obtaining the values from a controlled

vocabulary. Any kind of simple laboratory treatment can be described using a combination

of the class Treatment and the relationship to OntologyEntry, which captures the type of

treatment.

EXAMPLE: The mRNA that is hybridized to an array is captured in LabelledExtract.

LabelledExtract references the set of treatments that have been used to create it, via

Treatment, BioMaterialMeasurement and BioSample. Chemical compounds, such as the

fluorescent labels that are attached, are recorded in Compound. A cycle of treatments can be

described that points back the original starting material in BioSource.

This package does not contain any classes that are specific to a microarray experiment, and

therefore could potentially be used to model concepts from other types of functional genomics

experiment. This issue is expanded on in Chapter 4, in which MAGE-OM is combined with

a model of proteomics data to form a proposal for a data standard that we believe can be

extended to cover all functional genomics techniques.


2.3.2 PEDRo

In recent years, the success of MAGE-ML as a microarray standard has encouraged re-

searchers in proteomics to attempt a similar standardisation procedure. The status of pro-

teomics standardisation is the focus of the following chapter but a brief overview is given

here. The Proteomics Experiment Data Repository [315] (PEDRo) object model has been

released to initiate discussion in the community about the requirements for a data stan-

dard. Data standards for proteomics are managed by the Proteomics Standards Initiative

[257] (PSI), which was formed by the Human Proteome Organisation [161] (HUPO). PEDRo

represents a typical proteomics workflow, and consists of four parts:

• Biological sample origin.

• Protein separation techniques.

• Mass spectrometry laboratory protocols.

• Mass spectrometry data analysis.

PEDRo is designed to allow an experiment involving a number of stages of protein separation

to be described, including: 2-DE, affinity columns and chemical treatments. MS data is also

described in the PEDRo model, including support for storage of database searches and the

results of the searches. There are a number of organisations developing standards for MS

to serve different purposes (described below), therefore it is important that a consensus is

reached. A detailed description of PEDRo is given in the following chapter.

2.3.3 PSI-OM

PEDRo was presented to the PSI in 2003 as a proposal for a data standard for proteomics.

A new object model was developed in 2004, loosely based on PEDRo, called PSI-OM (Pro-

teomics Standards Initiative - Object Model) to which the author contributed at the annual

meetings of the PSI. PSI-OM has a similar structure to PEDRo covering protein separa-

tion techniques and MS. In the following chapter, there is a description of an object model

we developed (Gla-PSI) that preceded the development of PSI-OM, therefore a complete

description of PSI-OM is given after the section on Gla-PSI.

2.3.4 Mass spectrometry

Mass spectrometry is used in proteomics to identify proteins. An experiment generates raw

data, in the form of a trace, and processed data comprising a list of peaks that correspond to


the masses of peptides. There is a major problem preventing re-analysis of MS data, which

is caused by the proprietary data formats generated by mass spectrometer manufacturers.

Instruments are supplied with software for data collection and analysis. The software only

provides the functionality to save analysis within a data format that cannot be interpreted

by any other software. Researchers often manually enter the peak heights into a text editor,

for input into database search programs. Proprietary formats pose a major problem for

research throughput and data archiving. It cannot be assumed that the software needed to

interpret the spectra will still be available in the future. It is also not feasible for researchers

wishing to analyse the spectra deposited in databases, to obtain the software that produced

them. Therefore, there is a great need for a data exchange standard that can be interpreted

without specialist software. The standard must support algorithm development for large

scale database searches.

There are several proposals for MS standards including GAML (Generalized Analytical

Markup Language [128]), SpectroML and the Analytical Information Markup Language

(AnIML) [13]. Both SpectroML and AnIML have been developed by the National Institute

for Standards and Technology in the USA [222]. GAML is an industry generated effort to

develop an XML-based data format for analytical instruments. GAML stores values of X/Y

coordinates from a trace, and the parameters entered in the instrument. SpectroML has

similar goals, and was originally developed in collaboration with ASTM, an internationally

recognised standards organisation [18]. SpectroML has now been superseded at the ASTM

by AnIML, which is a wider XML based format for analytical instruments. The PEDRo

model also supports MS data.

A recent project has been initiated at the Institute for Systems Biology, known as

mzXML, which is part of the SASHIMI open source software for downstream analysis of

MS data [278]. The goal of the project is to produce software for processing each of the out-

put formats produced by different instrument vendors, into a single XML file. The mzXML

format can then be analysed with a single piece of software that has a statistical measure

of the likelihood that a correct match has been made to a protein. This should improve the

comparability of data produced by different types of instrument.

The efforts described above are being coordinated by a sub-group of the Proteomics

Standards Initiative, and meetings of the PSI have been well supported by MS instrument

manufacturers. A single proposal, mzData, has been formulated. It is agreed that vendors

will supply software with their instruments for creating output in mzData format. The


first version of mzData describes the raw data from MS, which is the list of peaks on the

trace, and the format also captures the input parameters that are produced by different

instruments [258]. The next version of the format will capture the input parameters and

results of database searches, in addition to the peak list used to identify proteins.

2.3.5 Protein interaction standards

Protein interaction experiments have become widespread, and there are a number of

databases that offer access to large volumes of data arising from Yeast Two-Hybrid and

affinity column experiments, such as BIND [32], DIP [65], MINT [367] and many others.

There is some overlap in the data coverage between the databases, and therefore it is desir-

able that data can easily be exchanged between different systems. This requirement led to

the development of the PSI interaction standard [150], which is now supported by most of

the publicly available databases. The format is being developed incrementally, and the first

release (level 1) covers the majority of data that is currently available. Level 1 can describe

both binary, and more complex interactions, but the format does not include detailed de-

scriptions of the experimental methodology used to generate the data, or a description of

the mechanism of interaction. This kind of data is not widely available at the present time

but may be supported in future versions of the standard.

2.3.6 Other data standards in life sciences

Mathematical models of biological data

The data generated by functional genomics, and traditional biochemistry experiments, reveal

information about the role of proteins and metabolites in a cell, and the interactions between

different components. Researchers have begun to create mathematical models of chemical

reactions and biological processes, which can in theory predict what changes would be prop-

agated to the system when part of it is perturbed. Mathematical models are published

in journals, often represented as a series of equations printed with mathematical symbols

that cannot be interpreted by a computer. Models are also represented by software, and

can therefore be released as computer code, however there are a large number of different

programming languages and different versions of code, therefore it is not easy to combine

models that have been developed independently. The problem is further complicated be-

cause processes can be modelled at different physiological levels: cellular, tissue, organ and

organ systems can all be represented mathematically. Researchers would ultimately like to


integrate models represented in different formats, and at different levels of detail.

CellML has been created to standardise the format in which mathematical models of

cellular functions are described [196]. CellML is expressed in XML, and uses constructs from

another well-established format known as MathML [340]. MathML describes mathematical

equations and consists of two types of encoding: content and presentation, the first for

expressing what is meant by a mathematical expression, the second deals with how the

expression should be presented for a web browser or printer.

The main constructs of CellML are components and variables, and MathML is used to

specify a mathematical relationship between variables that have been declared by a compo-

nent of the model. CellML also has structures for describing reactions, units, and connections

between different components. The complete specifications for CellML are available through

the web site [51]. It is hoped that researchers wishing to publish a model of a physiological

process will release the model in CellML, allowing future integration with other relevant

models.

The Systems Biology Markup Language (SBML) has been created to model biochemical

networks, such as metabolic pathways or sets of co-regulated genes [155]. Conceptually,

a biochemical reaction can be broken down into a number of components that comprise

the main parts of SBML, including Compartment, Reaction, Rule and several others, each

of which has a textual description, and a number of associated attributes. The format is

expressed in XML and there are various software packages that support the first version

of SBML [311]. The second version of SBML may include MathML support, which could

enable some interchange between models represented in CellML and SBML.

Metabolomics

A new area of functional genomics is the study of the composition of small molecules (metabo-

lites) in different samples, using NMR (Nuclear Magnetic Resonance) and mass spectrometry,

known as metabolomics. The metabolomics community does not have a current data stan-

dard, however a data model has been created to record a generic NMR experiment. The

work is part of the Collaborative Computing Project for the NMR community (CCPN).

CCPN contains an object model and a programming interface for creating software [113]. It

is possible that CCPN could contribute to a data standard for metabolomics although it is

likely that additional modules will be required to capture the biological focus and intention

of a metabolomics experiment.


An object model has been recently released as part of the Chemical Effects in Biological

Systems (CEBS) database developed by the National Center for Toxicogenomics in the USA

[355], called SysBio-OM. SysBio-OM covers various components of microarray, proteomics

and metabolomics experiments, however, due to its recent release, it is not possible to say

whether the metabolomics component will gain widespread use in the community. The CEBS

proposal is discussed in detail in Chapter 4.

2.4 Databases for life sciences

Databases are often created by small research communities wishing to disseminate their data

to a wider audience. The problem with this model is that no standard protocols exist for

accessing or querying databases, and many databases have their own text formats to allow

researchers to download the data in bulk. This presents several problems to the user. Firstly,

a researcher may not know about all the databases that exist which could be relevant. This

was the motivation for the creation of the NAR Molecular Biology Database Collection

to improve awareness of the databases that exist. Secondly, it is very slow to browse or

query all the relevant web sites manually, and assimilate the information by cutting and

pasting into a word processing document or spreadsheet. This problem is partly remedied

by systems like SRS (Sequence Retrieval System) [99], which present pointers to relevant

data items. However, the onus of data acquisition and assimilation of results is still on the

user. Thirdly, the databases are highly dynamic, and some are updated daily. Database

updates most commonly involve new data being added, but errors are also corrected and ID

numbers change with different database releases. Data that has been found by a researcher

may become out of date fairly rapidly, and there are no standard methods for automatic

repetition of the same searches. There are considerable efforts to alleviate these challenges

by employing data integration methods, described in Section 2.6.

A different aspect of the data integration challenge is the storage of heterogeneous data

types within unified systems that can be queried. Chapter 5 describes a database system for

proteomics, which is built on top of an existing microarray database system, as an extension

into a wider system for functional genomics. In this section, a comparison of the features

offered by different microarray databases is given, and the systems that already exist for

proteomics are described. There are several other databases that are highly relevant to

functional genomics research, outlined in Section 2.4.3.


2.4.1 Microarray databases

The development of a database that is capable of storing both proteomics and microarray

data is described in Chapter 5, which is an extension of the RAD (RNA Abundance Database)

system developed at the University of Pennsylvania. However, there are a large number

of different databases for microarrays that offer various different capabilities. A detailed

review of the main features of microarray databases was published by Gardiner-Garden and

Littlejohn in 2001 [119], which is brought up to date in this section (Table 2.1).

ArrayExpress

ArrayExpress at the European Bioinformatics Institute has been developed by researchers

who have been central to the efforts of MGED to standardise microarray data [16]. Ar-

rayExpress accepts public deposition of data, can be queried via a web based interface,

and is MIAME compliant. Data can be sent to ArrayExpress in MAGE-ML format, and

the database can store a significant amount of detail covering experimental protocols and

biological samples.

URL: www.ebi.ac.uk/arrayexpress/

RAD

RAD (RNA Abundance Database) is a system produced at the Center for Bioinformatics,

University of Pennsylvania [302]. RAD is capable of storing single or two channel arrays,

Affymetrix arrays and SAGE experiments (Serial Analysis of Gene Expression). There is

a web based interface for loading data and protocols known as the RAD Study-Annotator

[202]. The database schema for RAD, and the web interface, are freely available. As part

of the GUS (Genomics Unified Schema) system for functional genomics, it supports gene

expression data on several major web sites, such as PlasmoDB [21].

URL: www.cbil.upenn.edu/RAD

Stanford Microarray Database

The Stanford Microarray Database (SMD) [134] is a well established system that stores 160

published array experiments (March 2004), from a number of organisms. The web site can be

queried to retrieve particular studies, and a set of software is available for data visualisation

and statistical analysis, such as graphical output from ANOVA (analysis of variance [88]).

Searches can also be performed for a particular gene or clone across all microarrays. The


software used to generate SMD is freely available, and has been deployed by a several other

organisations. SMD researchers are part of the MGED effort, SMD is MIAME compliant

and there are plans to enable export of MAGE-ML in the future.

URL: genome-www5.stanford.edu

BASE

The BioArray Software System (BASE) is freely available for researchers to download and

install locally [275]. BASE includes a database schema that can be deployed in MySQL

[221], and an interface, which runs on a web server, can be created using PHP [246], Java

[169] and Javascript [171]. Data produced by image processing software can be loaded in

tab-delimited files, and additional software is included for performing statistical analysis.

BASE has several advantages over other similar systems. Firstly, all the software required to

run BASE is freely available: PHP, MySQL and Java. Secondly, all the source code for the

project can be downloaded and altered as required. However, a system based on MySQL is

likely to be less robust than one based on a commercial RDMS, such as Oracle [235] or DB2

[70], therefore BASE may be more suited to smaller scale microarray databases.

URL: base.thep.lu.se

GEO

The GEO (Gene Expression Omnibus) database is hosted at the NCBI [85]. GEO has

different goals from the other microarray databases discussed so far. The support of the

MIAME guidelines and the MAGE format are not major goals of GEO. In contrast, GEO

aims to act as a large public repository for as wide a range of data as possible. Each

experiment is stored in a simple, tabular format that is indexed to allow searches. Data

can be submitted by any organisation, using either a web based interface or a bulk loading

facility. GEO has been incorporated in the Entrez system3, and therefore information can

be queried in parallel with bibliographic references, and databases of nucleotide or protein

sequences [123]. GEO does not store substantial information about protocols or biological

samples, and can be viewed as a very large data repository rather than storing microarray

experiments.

URL: www.ncbi.nlm.nih.gov/geo/

3Entrez is the data retrieval system at the NCBI which performs queries over a large number of differentNCBI databases [97], described in Section 2.6.


Yale Microarray Database

Yale Microarray Database (YMD) [54] is in the final stages of testing with a number of data

sources, and is not as well established as ArrayExpress, SMD or RAD. However, YMD offers

certain features not present in other systems. Microarray images are fairly large, and each

experiment can contain hundreds of raw images, each being a TIFF file several megabytes

in size. Most databases choose to store only the processed data, created by software after

analysis of images. YMD includes an image server that enables researchers to obtain raw

images for future re-analysis. It remains to be seen how frequently images will be re-analysed,

but by keeping raw data, this ensures that future evaluation is possible, even if the amount

of data stored grows very rapidly. Experimental protocols can be entered via the Web,

and sample tracking can be performed to link DNA samples to the arrays. Data stored in

YMD can be linked to external resources, and a number of tools are available for performing

statistical analysis. The image server in YMD is both the advantage and disadvantage of the

system: data can be re-analysed but the system may not scale up to very large data sets.

URL: info.med.yale.edu/microarray/

HugeIndex

HugeIndex is a gene expression database developed at Harvard [148]. The database schema is

very simple, containing only four tables and it is intended for storage of microarray results and

limited information about the experiment. HugeIndex is specialised to store gene expression

data from normal human tissues. The query interface allows particular genes to be specified,

or data can be accessed by the type of organ. The initial release in 2002 contained 59

experiments.

URL: HugeIndex.org

Integration across all databases

A scheme for how data can ultimately be integrated across all the databases has been out-

lined by Stoeckert and colleagues [303]. In essence, all databases have different structures,

reflecting the needs and requirements of the local users that are supported by the system.

If data is to be published, it should be made available via the Web, and conform to the

MIAME guidelines that are essentially a checklist of parts of the analysis that must be made

available. However, this alone is not amenable to large scale automatic analysis. For that

to be possible, researchers must either make data available in MAGE-ML format, or submit


DatabaseName

RDMS Webqueries

Totalexpts.

Sourcecodeavailable

MIAMEcompliant

MAGEImport-Export

Array-Express

Oracle Yes 115 Yes Yes Import

BASE MySQL N/A Intended forlocal setup

Yes Yes Exportplanned

GEO Storage ofindexedtables

Yes 605 No or N/K No No

SMD Oracle Yes 160 Yes Yes Exportplanned

YMD Oracle N/K N/K Notcurrently

Notcurrently

N/K

RAD Oracle Yes 16 (RAD),many inGUS sites

Yes Yes Both underdev.

HugeIndex PostgresSQL Yes 59 (2002) Yes No Future plans

Table 2.1: Summary table displaying features of microarray databases. Data is correct as ofMarch 2004, except where stated. A N/K symbol (not known) indicates that the informationis not readily available.

data to a public database that has an export option for MAGE. Currently, few databases

actually create MAGE-ML, due to the complexity of the format, although almost all, with

the exception of GEO, plan to produce MAGE-ML in the future. When this is realised, it

will be possible to move data seamlessly between public repositories, and for researchers to

download and assemble large datasets, for analysis with locally installed software packages.

2.4.2 Proteomics databases

The following chapter contains a proposal for a standard data format for proteomics, and

covers the current output formats from several databases. There is also a detailed description

of other databases and a comparison with our system in Chapter 5. A brief overview of the

publicly available systems is given here.

There are a number of proteomics databases that can provide access via the Internet.

SWISS-2DPAGE was initially developed in 1993, storing 2-DE images and information about

proteins identified on gels. The proteins often have a link to a record in the annotated

sequence database, Swiss-Prot. SWISS-2DPAGE has an interface containing images of 2-D

gels, which can be used to access information about protein spots [153].

Another proteome database, developed by the Japanese Human Proteome Organisation

(J-HUPO [166]), has an output format known as HUP-ML. HUP-ML is centred on 2-DE data

and experimental protocols, allowing the constituents of solutions and timings to be specified,


similar to sample preparation stages described in MAGE-ML. There are a number of domain

specific proteome databases, storing 2-DE or MS data (a summary of proteomics databases

can be found at WORLD2D-PAGE [351]). In general, the databases store only limited

information about experimental protocols and are not fully integrated with other types of

protein databases. It is a major challenge to integrate distributed proteomics databases

because data is not formatted in a uniform manner, and the databases rarely offer flexible

query facilities.

The GELBANK system has recently been made available over the Internet [20], and

has similar functionality to SWISS-2DPAGE. There is also 2D-PAGE database at the Max-

Planck Institute in Berlin, storing images of 2-D gels that can be annotated with spot

coordinates which link to pages describing proteins that have been identified [255]. Basic

information about protocols is stored, and gels can be browsed by species. The functionality

of these 2-D gel databases is described in more detail in Chapter 5.

There are no major repositories of mass spectrometry data which have query facilities,

possibly due to the size of the output format for MS and the problems of incompatible data

formats, as reported in Section 2.3.4. One effort that attempts to remedy this situation is

RADARS [106], which is a commercial relational database application for managing large

volumes of data from high-throughput studies. Due to the commercial nature of the soft-

ware it is not possible to assess the functionality of RADARS in practice. Another recent

development is the Open Proteomics Database that allows bulk downloads of raw MS data

in various formats, including mzXML [232]. This system allows public access to a large

amount of data (400,000 spectra), but it requires developers to obtain software to interpret

and manage the spectra once downloaded, and the spectra cannot be queried online. This

prevents it from being used by most researchers, who do not have the time or resources to

obtain software for managing this volume of data.

2.4.3 Other Databases for Life Sciences

The databases for microarrays and proteomics rely heavily on the existence of genome

databases for linking to annotation about gene products, and obtaining the original DNA or

protein sequences. The main databases containing nucleotide and protein sequence data are

GenBank at the NCBI [122], EMBL [91] and DNA Data Bank of Japan (DDBJ) [80]. These

databases are generally considered to contain raw sequence data, although they do contain

some basic annotation, including bibliographic references, the data source and the predicted


intron/exon structure of genes. Data are regularly transferred between the databases us-

ing an agreed mapping, called the Feature Table [72]. Certain records in GenBank have a

link to an external database, such as a curated record in Swiss-Prot [310]. Swiss-Prot has

cross-references to many different databases, including all the raw sequence databases, and

repositories of protein motifs and families.

There has been an effort to unify protein sequence databases in the Universal Protein

Resource (UniProt) system [326], which comprises several components. The main component

is a curated, non-redundant source of all the protein sequences that exist in any database.

There is also a separate archive (UniParc) containing all the identifiers with which sequences

have previously been annotated [325]. The archive contains links to the most recent record

of a protein in UniProt. The archive will enable software to be developed which performs

repeated searches, to find changes to identifiers. This will be particularly important for

datasets that have been assembled locally over time in a laboratory, and which contain

sequence identifiers that do not exist in the current version of a database.

The protein structure community has initiated a high-throughput approach for obtaining

protein 3-D structures, known as structural genomics. Protein structures are currently stored

in the Protein Data Bank [253] (PDB). Each 3-D structure gives a strong indication of the

function of a protein, particularly if the structure shows a small molecule bound to an active

site. It is vital that if a structure exists for a protein closely related to those highlighted

in a functional genomics study, that the structures can be displayed within the context of

the experiment. This will enable protein or gene abundance studies to be correlated with a

detailed functional analysis.

This review covers a very small subset of the most important databases that exist. There

are a great number of resources about genes and proteins which could be relevant to an FG

experiment. The problem of integrating all the diverse databases is highlighted further in

Section 2.6.

2.5 Ontologies

One of the first definitions of an ontology and its potential for data integration was presented

by Gruber in 1993 [138]. The idea of conceptualisation was introduced, expressed as the

following problem: how can we digitally represent objects, concepts and their relationships

that arise from a real world situation? The representation required is, in effect, a simplified

view of the world that is useful for some purpose. The term “ontology” was coined to


describe the exact specifications of the conceptualisation. An ontology usually consists of

a set of terms that represent objects, and their relationships in the real world. The terms

must be associated with definitions that are human readable, describing what the term

means, along with a set of formal rules specifying how the terms can be used in a computer

system. Gruber suggests how ontologies can be used for data integration, using the example

of different bibliographic databases. For example, a rule is specified that describes what an

author is, and how the author relates to their publications. If different databases associate

records with a set of rules, the rules themselves can be used to query the source databases,

without an underlying knowledge of the particular database schema.

Ontologies will become widely used in the Semantic Web, as highlighted at the beginning

of the chapter, to describe the contents of a web site, and how it can be accessed. This will

enable software to discover automatically the resources that are relevant to the user. In

the rest of the section, a brief description of the software available for developing ontologies

is provided. The major proposals within the life sciences, and other related areas are also

reviewed.

2.5.1 Software for developing ontologies

A number of tools are available for generating ontologies, and they include Protege [230] and

OilEd [28]. The Protege software is available as open-source Java code, developed around a

‘plug-in’ architecture (Figure 2.7). This enables other research groups to adapt the software

for their own use, and develop new plug-ins. Examples of plug-ins include: software for

visualising ontologies in domain-specific ways, tools for merging ontologies, archiving and

querying. OilEd was developed at the University of Manchester and it is designed for the

development of DAML+OIL ontologies. It includes functionality for reasoning over the

ontologies for knowledge acquisition and inconsistency checking. Both Protege and OilEd

are freely available and can export data in DAML+OIL format, enabling the ontologies to be

transferred between editors, which should improve the accessibility of ontology information.

2.5.2 Gene Ontology

A major development in computational biology is the development of the Gene Ontology

(GO) [125, 126]. GO includes three ontologies: cellular localisation, molecular function and

biological process, for a number of model organisms. Entries for genes are sorted according

to the categories defined by the ontology, and the controlled vocabularies ensure that terms


Figure 2.7: A screenshot of the Protege editor displaying the Gene Ontology for Yeast.

are used with same meaning in different contexts. For example, the protein Raf-1, that is

involved in the MAP Kinase metabolic signalling pathway, has many entries in GO. One

entry in the biological process branch of the ontology is as follows:

GO:0003673 : Gene_Ontology ( 149784 )

GO:0008150 : biological_process ( 99849 )

GO:0009987 : cellular process ( 32926 )

GO:0007154 : cell communication ( 9155 )

GO:0007165 : signal transduction ( 6932 )

GO:0007242 : intracellular signaling cascade ( 2389 )

GO:0007243 : protein kinase cascade ( 904 )

A database and a user interface have been developed that enable GO to be queried [126].

GO annotations are being added to Swiss-Prot, TrEMBL4 and Interpro5, in a project known

as GOA [46] (Gene Ontology Annotation). Each entry in Swiss-Prot has several keywords

that describe a protein’s function, which were developed prior to the creation of GO. The

keywords have been manually mapped to GO terms. This now allows for automatic retrieval

of GO annotations, once a protein sequence has been found in Swiss-Prot.

4TrEMBL is an automatically annotated supplement of Swiss-Prot, which contains all the translations ofthe EMBL DNA database prior to their manual annotation within Swiss-Prot [37].

5Interpro is a database of protein families and domains [219].


Figure 2.8: The entry for actin in the Gene Ontology, displayed in the AmiGo browser [12].

There are a number of other projects extending GO, and GO is being used by a number

of organisations to add levels of information to gene and protein products (links can be

found from the GO web site [124]). GO is a major advance in molecular biology because

it enables a high level view of large datasets, allowing researchers to generate functional

classifications very rapidly for all genes or proteins in a data set. However, it is vital that

the Gene Ontology is continuously curated and improved, to reduce the number of incorrect

or inaccurate functional assignments. It is becoming common practice for researchers to

obtain the top set of significant results from their study, say 100 genes or proteins, and

assign functional groupings based on GO. The conclusions drawn from the groupings must

be verified by external means, such as further experiments or literature surveys, because it

is possible that errors have been introduced into GO, which may be propagated into other

systems built on top.

Software for GO

A number of software applications are available for viewing and searching GO, of which

several are summarised here. Access from the Gene Ontology web site is provided by the

AmiGO browser [12]. AmiGO presents a view of the GO tree that can be browsed, allowing


users to move up or down the hierarchy of the ontology. Figure 2.8 displays the GO tree for

the human gene actin. In this example, GO suggests that actin is localised in the cytoplasm,

and more specifically to the cytoskeleton of the cell. A gene can be found at many different

places in GO if the gene has been implicated in several different processes, or possibly if there

is conflicting evidence about function. The AmiGO browser has basic search mechanisms for

retrieving entries by GO ID, ontology term or gene name. There is an alternative graphical

view of GO, and parts of the tree can be downloaded in XML format or as a text file.

GOMiner is a stand alone application written in Java, which provides a view of GO for

a list of genes that are predicted to be up or down regulated between two conditions [368].

The software displays where gene names are located in the GO tree and provides statistics

to show branches of GO that contain more up-regulated (or down-regulated) genes. A DAG

(Directed Acyclic Graph) viewer is also included that displays graphically where genes appear

in the tree.

FatiGO offers similar functionality to GOMiner but in a web browser interface that

accepts two lists of gene symbols, corresponding to the genes that are up or down regulated in

a study [7]. Summary information is produced outlining where terms appear in the ontology,

for the three different ontology parts. Statistics are provided displaying which parts of the

ontology are matched to genes that are up or down regulated in the study. The software

can display information for a specified level in the ontology, from 2 down to 5 (lowest level)

and links to external databases are provided, such as Swiss-Prot, and the KEGG database

of metabolic pathways [184]. The usage of GoMiner and FatiGO in practice is demonstrated

in the study presented in Chapter 6.

GOblet also provides access to GO via the Web. DNA or protein sequences can be

submitted to a BLAST survey that returns the best matches to sequences in Swiss-Prot and

TrEMBL, which have been mapped to GO terms [149].

2.5.3 MGED Ontology

The MGED Ontology (MO) is a hierarchical collection of terms used to describe microarray

experiments. Each term has a textual description of its meaning, and a specification of

where it should be used in MAGE-OM. MO contains terms that can be used to describe the

origin and characteristics of biological samples, regardless of the usage of the sample. For

this reason, MO could be utilised to describe samples in a number of functional genomics

investigations. In Chapter 4, a proposal is made for a functional genomics data standard,


and a detailed description of the contents and structure of MO is given there.

2.5.4 Other ontologies in life sciences

Ontologies are being created to model various different domains within the life sciences. The

OBO (Open Biology Ontologies) project aims to bring together related ontologies into a

common structure [233]. A set of rules has been established for inclusion within OBO: the

ontologies have to open and freely available, described in a common syntax (GO or OWL)

and must have a definition that can be understood by people. An organisation has also been

established for unifying the work in ontologies for functional genomics, known as Standards

and Ontologies for Functional Genomics [298] (SOFG). A brief description of some of the

ontologies within OBO is given below.

Taxonomy

The NCBI taxonomy ontology is an important resource for standardising the taxonomic

naming of organisms [224]. The ontology is accessible via the Web, and the records contain

links to other information about the organism through Entrez, such as nucleotide and protein

sequences, expression data and publications.

Anatomy

There are several ontologies covering the anatomy of organisms: such as C. elegans [353],

Drosophila [112], mouse [90, 218] and humans [100]. The SOFG organisation is coordinating

an effort to integrate them to produce a single anatomical ontology. A related project

is XSPAN from the University of Edinburgh, which aims to provide access to anatomical

information from embryos for several model organisms [358].

Sequence data

The Sequence Ontology (SO) project has recently been initiated to capture information

about features on DNA and protein sequences, such as chromosomal variations, gene features

(intron and exon structure) and RNA processing during transcription [286]. It is intended

that genomic databases will be annotated with these terms to facilitate integration across

systems offering different methods of querying.


Metabolic pathways

One of the first major proposals for ontologies for molecular biology was made by Karp in

1995 [183]. Karp presents the idea that knowledge representation could be used to determine

mappings between different databases to aid integration. The architecture proposed by Karp

was influential in a number of data integration projects described in Section 2.6. A database

of E. coli genes and biochemical pathways was later defined, known as EcoCyc [182]. EcoCyc

contains curated descriptions of the function and chromosomal location of all E. coli genes,

and uses an ontology of pathways to allow the knowledge to be formally queried. EcoCyc

presents an integrated view of data derived from a number of sources including genome

databases, bibliographic references, and protein structures.

Summary

In a functional genomics database, many of the ontologies described above could be used for

specifying characteristics of biological samples, genes, proteins or experimental techniques.

Database systems for functional genomics should provide the facility to link out to external

ontologies so that an object can be specified, which is accompanied by an exact definition

that has a meaning outside the scope of the source database. It is hoped that if databases

use ontologies extensively, the vision of the Semantic Web can be realised, and as Gruber

proposed, data integration can become automated. Software can then be developed to recog-

nise objects automatically in different databases which correspond to the same real world

objects.

2.5.5 The Grid and data integration

The Grid is the next generation architecture for high performance computing [132]. The

Grid is a network of computers joined by high bandwidth connections, allowing the creation

of software that assigns a computationally intensive job to the best available resource on the

network. There is a collaborative effort to perform data integration on a large scale via the

Grid, known as OGSA-DAI (Open Grid Services Architecture Data Access and Integration)

[234]. OGSA-DAI comprises many projects aiming to provide access to vast data sets, in

particular in the fields of astronomy, geoscience and biology. One of the major biological

proposals is called myGrid.

myGrid comprises a network of biological web services, such as BLAST and EMBOSS6,

6EMBOSS is an open source package of software for performing common sequence analysis tasks [92].


which must be registered at a central location [300]. Each resource must contain a standard

description of the type of service it offers, and how it can be accessed. Once this infrastructure

is in place, it will be possible to write software that automatically discovers applications that

are available for performing the task required by the user. Each service has a wrapper7 to

enable standard queries to be submitted, and to convert between different input formats.

Queries can be written in OQL [1] (Object Query Language) and submitted to the source

database over the Grid. The system is specifically tailored to an organisation because a

database is maintained at each location, storing a record of the services that have been used

in the context of a particular workflow, thus facilitating their re-use. The local database also

stores an audit trail of what services have been used at what time, with a system that alerts

the developers if an external data source or service changes, such as a new database release,

which may require a search to be performed again.

2.5.6 Data standards and ontologies in other fields

Ontologies and data standards are becoming widespread in the life sciences but are also

widely used in commercial applications and other fields of research. A related area is the

development of ontologies of language, which could also have uses in the life sciences. The

WordNet project comprises an ontology of the English language, in which nouns, verbs,

adjectives and adverbs are organised into synonymous groupings, similar to a thesaurus [350].

Synonyms in the life sciences present considerable challenges. In particular, many genes and

proteins have been given more than one name over time, and the synonyms often persist. It

is becoming more common to store experimental protocols and descriptions of hypotheses

alongside raw data, to enable data sets to be retrieved. Resources such as WordNet will

be useful for defining particular concepts in a standard way that could be described using

different, synonymous terms.

2.6 Data integration

Data integration is one of the greatest challenges currently facing bioinformatics [299]. The

Molecular Biology Database Collection contains 548 databases at present, and this is likely

to be an underestimate of the total number of different systems that are available. The

integration challenge can be broken down into different parts: firstly, bringing together

7A wrapper is a piece of code that converts the specific inputs and outputs offered by a single applicationto a standard set of inputs and outputs.


similar types of data, such as genome, transcriptome and proteome into a single system that

can be adequately queried is one challenge. A second challenge is discovering and querying

all the resources on the Internet that relate to one particular gene or protein sequence.

The first challenge of data integration is addressed in Chapter 5, in which a framework is

described for storing different types of FG data in one system. A possible solution to the

second challenge has also been addressed, in the context of indexing large collections of XML

data, to generate an integrated query system to a number of databases. An investigation

by the author into XML indexing for biological data is described in Appendix A, which has

been continued in the Xtect project [359] by colleagues at the University of Glasgow and the

University of Strathclyde.

There has been substantial work in the area of data integration in e-commerce and

biomedical fields with the aim of generating single access points to heterogeneous data

sources. In a survey of approaches by Garcia-Molina et al. [118], three general methods are

identified: federation, warehousing andmediation. Federation involves a set of databases sup-

plying agreed additional information or software for accessing information in a standard way.

Warehousing is a large scale approach of reconstructing local copies of relevant databases by

creating an integrated schema that covers all the constituent databases, and importing data

on a regular basis. Mediation based approaches send queries to diverse databases, in some

cases via the Internet, and convert the results into a single format. Examples of biological

resources that have utilised these approaches are given below.

2.6.1 Federation

The Entrez system provides access to many different databases based at the NCBI [314].

Entrez queries GenBank, PubMed, GEO and many others (the web site has a complete list

[97]), and provides a number of output formats including HTML, XML and a text format.

However, there is no integration of results, instead a list of the number of hits in each of

the database is returned, which must be manually browsed by the user. This process is very

time consuming, especially if a large number of genes are to be queried, such as the top 200

hits from a microarray experiment.

2.6.2 Warehouses

One of the largest efforts to integrate life sciences data has been demonstrated by SRS [99]

(Sequence Retrieval System), which provides access to a large number of databases using


pre-defined hyperlinks. SRS downloads all the source databases at a regular interval and

builds a text index. SRS accepts queries against any type of text in the entry, and allows

users to retrieve a record with a particular ID number. SRS does not post-process the queries

to integrate the information, instead a list of entries from different databases is returned.

SRS does not support a major query language such as SQL, therefore complex queries cannot

be made.

The GUS system from the University of Pennsylvania comprises a large relational schema

that is divided into different namespaces8, which have been developed from separate source

databases. Data from various sources (Genbank, EMBL, DDBJ and others) are downloaded

at intervals, cleaned to remove erroneous annotation, and added to the database. A pro-

gramming layer resides on top of the database to allow queries to be performed. In addition

to genomic data, GUS also stores microarray data, and Chapter 5 describes a proposal for a

proteomics extension to GUS.

2.6.3 Mediator approaches

K2

An approach known as K2 has also been developed at the University of Pennsylvania, which

formulates queries over a number of databases, and presents an integrated view to the user.

K2 originated in a project known as Kleisli that introduced the idea of mediators [68, 348]

and a query language known as Collection Programming Language (CPL). The mediators

describe the data sources in terms of common objects, and provide a mapping from the un-

derlying data source to the objects. CPL can then be used to query the object representation

of the data, even if the underlying data sources do not have query capabilities. The system

has been compared with GUS by Davidson and colleagues [67]. Davidson concludes that the

data warehousing approach is preferable for larger scale, production-strength applications,

and the mediator approach may be favoured for smaller systems for users wishing to browse

data sources via web pages.

TAMBIS and BioDataServer

There are a number of other bioinformatics systems to integrate heterogeneous resources,

including TAMBIS, which has been developed as an interface to a number of databases

8A namespace is a subdivision of a database schema or object model in which all the names of thecomponents are unique.


and tools frequently used by biologists [241]. TAMBIS is developed from the same software

used to generate K2, and uses mediators to access several databases. TAMBIS is supported

by a description logic known as GRAIL [22] that includes rules to link different concepts

together. For example, a protein is formally linked to motifs found in its sequence by the

rule hasComponentMotif. GRAIL is used to formulate queries, and automatically retrieves

information from the relevant database.

Another mediator based system is BioDataServer [114] that enables information retrieval

over a number of biological databases, with similar goals to K2. BioDataServer generates an

interface that maps the data sources into a relational database, and can be viewed as a cross

between a mediator and a warehousing approach. BioDataServer enables complex queries

to be formulated over the data, even if the underlying query capabilities of the data sources

do not support SQL.

DiscoveryLink

DiscoveryLink from IBM offers access using SQL to a number of databases in distributed

locations [144]. DiscoveryLink processes queries and decides which parts of the query need

to be sent to which database. Each data source has a wrapper that maps the structure

of the source data to the relational model employed by DiscoveryLink. The wrapper also

stores information about the query capabilities of the data source, and maps parts of queries

sent from DiscoveryLink into the format accepted by the data source. For a wrapper to

be developed, it is required that the underlying data source must include an interface that

accepts programmable queries, and must return data in the form of a table. The software

can then process the results after they have been returned. DiscoveryLink does not offer

any kind of semantic integration, for example the problem of synonyms is not solved, and

redundant data may be returned. If data is modelled differently by different databases, all

the results will be presented to the user, but will not be fully integrated.

2.6.4 Schema integration

An alternative approach, that could be used to develop a warehouse, is that of schema

integration, which involves matching elements in different database schemas believed to

correspond to the same real world object. Many approaches involve a manual process using

a graphical user interface to match elements from different schemas, which is time-consuming

and error-prone. Attempts have been made to automate schema matching [81], and recent


work has been done to integrate XML data sources. Integration using XML is gaining

popularity in molecular biology, because many databases now offer a bulk download of data

in XML. If a mapping can be produced across different XML Schemas, a warehouse could

be created by importing different databases in XML, and converting data to a standard

representation.

Yang et al. [363] have developed an algorithm that finds matching elements in XML

Schemas and removes differences in the hierarchical structure. The algorithm allows differ-

ent schemas to be weighted according to how representative they are of the system, and

produces an integrated schema. However, it is assumed that elements in different documents

have already been re-named so that real world objects all share the same name in different

schemas. This is not necessarily a trivial task if a great number of different databases are to

be integrated. In molecular biology many concepts have synonyms, and conversely, similar

but non-identical concepts may share the same name. A similar approach relevant to data

integration is XClust [194], which is an algorithm for clustering and integrating DTDs (doc-

ument type definition, the initial proposal for validating XML). The algorithm first searches

for similar DTDs, and then integrates over clusters of related documents. The technique has

been demonstrated for real world data sets derived from e-business, and may be applicable

to biological database integration.

A recent approach by Hunt and colleagues at Glasgow University aims to alleviate the

data integration challenge by developing indexes of the paths9 found in XML documents. If

more than one identical path is found to the same leaf node, containing the same piece of

data, the additional paths are removed to avoid redundancy. The index is created on top of

the SRS system and is stored in a relational database. It is intended that the system will be

used to retrieve data from a large number of databases, for a set of genes or proteins that

are highlighted for further study from a functional genomics experiment [159].

2.7 Discussion

This chapter describes the current state of the art in database technology for biomedical

research. It is an area that is being driven by both the day-to-day requirements of ex-

perimentalists, and strong theoretical work in computing science. The challenge of data

integration is so great because FG experiments often generate unexpected results that must

be investigated from various perspectives. In the past, a biological investigation required

9A path is the hierarchy of elements that the precede textual data in XML.


a researcher to have a comprehensive knowledge of a particular organism, organ, or set of

genes. The situation now is far more challenging, as the results from a functional genomics

experiment could lead an investigator into a great variety of domains. For example, the

top 200 hits from a microarray investigation on liver samples could contain genes that had

previously not been implicated in liver function at all. This would require the investigator

to determine the function of the genes from a number of different angles: protein structure,

modifications, biochemistry analysis from databases or the literature, and several others.

Many of the new developments presented in this chapter aim to improve the facilities for

automating the retrieval of this information.

The vision of the Semantic Web is one of the driving forces of the work on standards and

ontologies, but its realisation is some way off. The technologies that will be used to create

the Semantic Web can be put into practice now, and will greatly improve the capabilities

of computer systems. It is clear that there are major advantages to the use of ontologies in

databases, for web publishing of data and in exchange formats, which are as follows. Firstly,

the problem of synonyms in the life sciences is significant. The names of genes and proteins

have been assigned in the last few decades, often based on some phenotypic characteristic that

has limited relevance now that comparative genomics can discover the same gene in different

organisms. For example, the “wingless” gene in Drosophila has a number of synonyms

reflecting the range of roles that it has in different parts of the organism. It was named

because when its function is removed, the flies have no wings, which is clearly of limited

relevance for the same gene in humans. It is becoming apparent that a new organised

naming system is required that takes into account the role of a gene in different organisms.

This is one of the areas that will be aided greatly by the Gene Ontology. A second problem

is finding a common description of how experiments have been performed (the methods).

An ontology-based description of experimental protocols will aid the retrieval of experiments

stored in databases, and may allow future reasoning over different experiments to find how

they are different. For example, it may be possible to find automatically the genes with

altered expression that differentiate two strains of an organism, if the description of each

strain is well structured using ontologies. The synonym problem also arises in the description

of protocols. For instance, in the description of microarrays in the previous chapter, I avoided

the term “probe”, which is frequently used in the methods section of microarray publications.

“Probe” is used by some groups to mean the features deposited on the array, and by others

to describe the labelled mRNA that is hybridized to the array. It is hoped that the MGED


Ontology (MO) can remove these kinds of problems because it contains terms with strict

definitions that are not open to confusion, and therefore software can be developed to search

for particular terms, knowing that queries will be answered correctly. If MO can be extended

to describe all functional genomics investigations, and gains widespread usage, we will be

someway towards solving the problem of imprecise language that hinders automated analysis.

The work on data standards is essential to allow the creation of public databases that

can be queried, and to allow data sets to be downloaded in bulk for re-analysis with new

statistical techniques. The object models are a vital component that allow developers to have

a shared understanding of large systems. The models also allow software to be developed

for creating standard output in an exchange format, and can act as a bridge between flat

files and database storage. In particular, MAGE-OM has influenced efforts in other parts

of functional genomics because it has gained widespread community acceptance, and it is

forward thinking in the use of ontologies.

It could be argued that many of the data integration methods currently being developed

will not be required if the Semantic Web is successful. In reality, it is the data integration

efforts that are currently on-going that will evolve into the vision of the Semantic Web. The

development of ontologies to describe biological knowledge as it is now, and to describe the

experimental process that was used to produce the knowledge, will be vital. In addition, the

schema matching techniques, that aim to find the commonality between different databases,

will be a vital intermediate step towards fully interoperable systems. The data integration

methods will help us to learn how different data are structured, and how they can be described

in common terms.

The solution to the data integration challenge is still an open research question. The

majority of research groups performing functional genomics investigations are left with a

laborious, time consuming task, often involving manual Web browsing, to assimilate infor-

mation about genes, proteins and pathways. If systems can be developed that automate this

process, they will free up large amounts of research time that could be better spent else-

where, and new knowledge will be derived by discovering the relationships between different

components of biological systems.

2.8 Conclusions

In this chapter, a brief overview has been given of the different databases that exist for storing

functional genomics data, and the data integration challenges that they present. It is vital


that data standards and ontologies are created to allow researchers to exchange and transfer

data sets to central repositories. An overview of the major proposals has been given. In the

following chapter, the current status of proteomics data standards is described. The main

focus is the development of a new object model that supplements the first draft standard,

and there is a discussion of the future of data sharing and publishing in proteomics.

Chapter 3

An object model for proteomics

3.1 Introduction

The first two chapters outlined the computational requirements of functional genomics ex-

periments, and previous work in databases, standards and ontologies for life sciences. This

chapter comprises two parts, the first focuses on the development of an object model to cap-

ture proteomics data, which we released as a proposal for a standard data format and was

published in October 2003 [176]. The first model is referred to as Gla-PSI (Glasgow proposal

for the Proteomics Standards Initiative) and covers studies in which proteins are separated

by two-dimensional gel electrophoresis (2-DE), and identified by mass spectrometry (MS).

Gla-PSI was developed to supplement the draft standard for proteomics, the Proteomics

Experiment Data Repository (PEDRo) originating at the University of Manchester, which

was released in January 2003 [315]. The latter part of the chapter outlines the continued

development of the official standard of the Proteomics Standards Initiative with which we

have been involved. The new object model, PSI-OM (Proteomics Standards Initiative Ob-

ject Model), was initiated in 2004 at the annual meeting of the PSI. We have contributed

to the development in collaboration with other members of PSI. PSI-OM has evolved from

PEDRo, and includes parts of the data model from Gla-PSI.

3.1.1 The emergence of proteomics

The challenge in proteomics research is to characterise the expression of all, or as many as

possible, of the proteins in a sample of interest. Comparative analysis may also be carried

out to determine the difference in protein expression between two or more samples, and the

differences may provide clues as to proteins that are critical for the process being studied,

such as a disease. Two dimensional gel electrophoresis (2-DE) is frequently used to separate

79

Chapter 3. An object model for proteomics 80

proteins into discrete spots that may be quantifiable, and mass spectrometry (MS) is often

used to determine the observed mass of the peptides in the protein. Observed peptides masses

can be used in a search against a sequence database to identify the protein. There are also

new protein separation techniques, such as multi-dimensional chromatography and affinity

methods, for determining the proteins that are present in a sample. The core techniques of

2-DE and MS have been available to researchers for several decades but it is only in recent

years that large scale analysis has become feasible, forming the field of proteomics. There

have been gradual improvements in the experimental protocols for 2-DE and several new

stains have been designed that improve the linearity of the relationship between visible stain

and the actual amount of protein in a gel spot. Software has been developed for improved

detection and quantification of spots on gels, and matching spots between different gels. The

technology for MS has also moved forward with improved ionisation protocols and detection

mechanisms (described in Chapter 1). However, the main reason for the major shift in

research paradigm towards the global approach has not been related to the improvements

in 2-DE and MS, but can be attributed to the vast increase in the availability of DNA

and protein sequence data in the genome databases. MS is only a good method of high-

throughput protein identification, if protein sequences are deposited in a database, or very

closely related sequences exist. Therefore, without the major sequence databases, it would

not be possible to perform large scale proteomic investigations.

3.1.2 Publication of data

In other areas of biology, the deposition of data in a central repository is a prerequisite for

publication: DNA sequences must be deposited in GenBank [30] and protein structures in

the PDB [253]. At present, public access to large amounts of proteomics data is limited.

There is a database of 2-D gels, hosted at the Swiss Institute of Bioinformatics, known as

SWISS-2DPAGE [153], and a similar effort at Argonne national lab, called GELBANK [20].

Both databases offer images of 2-D gels that can be browsed, providing access to limited

information about spots identified on gels (described in Section 3.3). However, the general

availability of proteomics data is poor, and most journal publications only display gel images

and a table of proteins that have been identified. Essentially this information is inaccessible

to computational analysis, even if the data is placed on the author’s web site, because there

is no common mechanism for querying or finding the data. The same issue exists for the

related fields of phylogeny and immunohistochemistry where diagrams of trees, or images


of cells, are reproduced in journals but are not open to computational analysis. The rate

of production of data is too great for researchers to have complete awareness of the protein

expression data that could relate to the system they are studying, and if there is no change

in the way proteomics data is represented, the situation will become far worse.

3.1.3 A central repository for proteomics

There is a major requirement for the development of a central proteome database that

includes 2-DE images, their analysis and MS data. For such a plan to be realised, it is vital

that a standard data model is adopted by the research community to enable experiments

from different laboratories to be compared or queried. A central database must contain

sufficient detail about experimental protocols for the context of the experiment to be fully

understood. It is also important that statistical analysis is captured, to ensure that new

results derived from data are electronically accessible, and can be verified. It is only in

the last two years that efforts have been initiated to develop a standard data format for

proteomics, resulting in the PEDRo proposal released in January 2003 [315]. A community

wide proteomics standard is still some way off, even though in the related field of microarray

analysis the data format MAGE-ML [297] has become well established in a relatively short

time frame. There are several reasons for the delay in finding a consensus on a standard.

The most significant challenge is the complexity of proteomics experiments compared with

microarrays. The identity of each feature on a microarray is known in advance, and matching

data points across a set of microarrays is a trivial task. In proteomics, proteins spots on a

2-D gel have to be identified by some process, which may be error prone, single spots may

contain multiple proteins and multiple different forms of the same protein can appear in

several positions on one gel. The reproducibility of 2-DE has improved greatly but is still

far from 100%. There are also various statistical models of the match quality for proteins

identified with MS data, but no single standard that can be compared across experiments.

The result is that a single proteomics data set is complex, and the experimental methodology

is rarely homogenous across different laboratories. This presents a major challenge because

it is difficult to create a model that captures all the methods used, and data that may arise

in proteomics experiments. In consequence, heterogeneous data formats are used, which are

difficult to load into a central database that supports queries over experiments produced by

different laboratories.

The expression of all the proteins in a disease sample compared with a normal sample


can facilitate understanding the disease process but the information can also provide an

additional level of information to sequence databases [4]. For example, if experiments to

determine the proteome of human liver cells reveal that a specific protein is abundantly

expressed, the information is functionally significant and should be available to researchers

accessing sequence databases. Additionally, gel spots analysed by MS may reveal peptides

that match a region of genomic sequence that has not been annotated. Therefore, the peptide

sequences can be used to discover new genes, or edit incorrectly annotated genes.

The global protein profile generated by experiments depends upon the conditions under

which the sample was produced and processed, prior to separation by 2-DE. The data may

be valuable to researchers in diverse fields, who could obtain new results from data sets

originally intended for another purpose. Therefore, it is vital that experimental protocols are

rigorously documented, according to a shared standard, and stored in a structured format

that allows searches over biological conditions: species, cell, tissue type; or experimental

conditions, such as: gel constituents, stain, or MS instrument parameters.

3.1.4 The status of proteomics standards

The Proteomics Standards Initiative (PSI) [257] was formed by the Human Proteome Organ-

isation (HUPO). So far, there have been two annual meetings at the European Bioinformatics

Institute [236, 237] and one meeting in Nice, France in 2004. The PEDRo proposal for a stan-

dard was released to demonstrate that a universal proteomics data format could be feasible,

and to stimulate discussion from the proteomics community about the requirements for a

standard (described in detail in Section 3.3). PEDRo focuses on the experimental techniques

used by proteomics researchers. Gla-PSI was developed at the same time as PEDRo, but

was modified following the release of PEDRo to model in more detail the protein data that

arises in a proteomics experiment (described in Section 3.4). Gla-PSI models 2-DE data, dif-

ference gel electrophoresis, image analysis and statistical analysis of large data sets (Figure

3.1). These data types are not adequately covered in PEDRo and therefore Gla-PSI acts as

a proposal for additional information that should be captured in the community standard.

Gla-PSI allows researchers to store data from any of the image analysis applications that

are available. Statistical analyses performed on data produced from image processing, such

as software, algorithms and the associated parameters, can also be captured. The model

is further specialised to manage difference gel electrophoresis data. Gla-PSI links spots

visualised on a gel, to proteins that have been identified by MS. The model is not a proposal


BiologicalSample

BiologicalSample

BiologicalSample

Legend

Sample Flow

Data Flow

Search

Mass Spectrometry MS/MSMALDI

Sequence Database

Solubilisation

DesignExperiment

StatisticalAnalysis

Image Analysis

Overview of a Proteomics Experiment

Protein Identification

ID Vol X Y Protein

1 454 23 24

2 222 28 87 abc1

3 12 20 12

4 662 262 101

1 454 23 24

2 222 28 87

3 12 20 12

4 662 262 101

1 454 23 24

2 222 28 87

3 12 20 12

4 662 262 101

ID Vol X Y Protein ID Vol X Y Protein

Global Expression Profile

Protein

2D−PAGE

Figure 3.1: The data flow in a proteomics experiments. The parts of the analysis covered byGla-PSI are boxed.


for annotation standards for MS, however there are a number of groups working towards a

standard for MS under the auspices of PSI, described in Chapter 2. PSI will oversee the

development of a complete model for proteomics that encompasses sample origin, 2-DE and

MS. The current status of proteome standards is presented in Section 3.5.

A new model, PSI-OM (PSI object model), is under development following several work-

shop meetings. The new model has evolved from PEDRo and includes part of the data

model from Gla-PSI. PSI-OM will ultimately be merged with the microarray data model to

form a single unified standard for functional genomics, as described in the following chap-

ter. It has been recognised during the development of microarray standards that controlled

vocabularies (ontologies) are critical for the creation of systems that have enough flexibility

to capture a wide range of experiment types, and allow the information to be queried in

complex ways. An ontology for proteomics is under development, as described in Section

3.5.3. A major contribution towards microarray standardisation was the release of a set

of guidelines for researchers wishing to publish, known as MIAME (Minimum Information

About a Microarray Experiment) [41]. A similar effort is underway in proteomics that will

be released in late 2004 or early 2005 (Section 3.5.4).

The rest of the chapter is structured as follows. Section 3.2 describes the methodology

used to develop Gla-PSI, and how requirements capture was carried out. The previous work

in proteomics data formats and standards is given in Section 3.3. A detailed description

of Gla-PSI is given in Section 3.4. The future development of a community wide proteome

standard, an ontology and guidelines for publication are described in Section 3.5. Section 3.6

includes a discussion of the importance of standards for proteomics, and the current status

of public access to proteomics data.

3.2 Methods

The early stages of developing a standard involved the creation of a prototype database for

2-DE and MS data by the author for a Master’s degree by research [174]. The database high-

lighted the challenges of integrating heterogeneous data types, and capturing experimental

protocols, in a structured format. The prototype demonstrated that many types of questions

that biologists posed could not be answered using the current technology, which would be

solved by the development of a central repository and appropriate query tools.

Case studies into proteomics investigation have been carried out (Chapter 1) which

demonstrated the requirement for new bioinformatics tools to facilitate the analysis of large


protein data sets. The case studies also highlighted significant challenges in data integra-

tion and systems development, and found several areas in which proteomics techniques are

employed:

• Proteome cataloguing: determine the entire set of proteins expressed in a cell type,

organelle or microorganism.

• Hypothesis generation: discover proteins whose function may be important in the

condition of interest.

• Protein regulation: discover sets of proteins that share patterns of expression across

a range of sample conditions.

• Correlating gene and protein expression.

• Post-translational modifications: which include phosphorylation, glycosylation

and acetylation.

The case studies also revealed that a critical factor required for aiding proteomics research

is the development of a data standard. Therefore, Gla-PSI was initiated to model data

from 2-DE, difference gel electrophoresis, image analysis and statistical processing. The

development of the model was driven by analysis of real data sets, and an understanding

of the types of queries that researchers would like to pose. The experimental basis for Gla-

PSI was established over a significant period in which requirements capture was performed

(Table 3.1). A number of interviews were held with principal investigators in laboratories

performing proteomics investigations. Time was also spent shadowing bench researchers to

gain a better understanding of the techniques involved in the research. Finally, literature

surveys were performed into functional genomics investigations, databases for life sciences,

and data standards in other fields to learn what procedures are commonly used to model

complex domains. During the development of Gla-PSI, regular meetings were held to present

the model to biological researchers, gaining feedback to ensure that a database based on the

model would cover all the data types that are required.

The data flow shown in Figure 3.1 outlines the stages in which information must be

captured in a proteomics experiment, and the boxed area represents the part of the anal-

ysis covered by Gla-PSI. Gla-PSI is expressed in Unified Modeling Language [324] (UML,

described in Chapter 2) and was developed using the UML modelling tool Rational Rose

[266]. Gla-PSI comprises class diagrams in UML to represent the concepts, objects and

relationships in a proteomics experiment.


Name Position Meet-ings

TimeSpan

Description

DrJonathanWastling

Principalinvestigator

50 2001-2004 Dr Wastling runs a laboratory that uses pro-teomic techniques to investigate parasitol-ogy. Many meeting were held in which dif-ferent proteomic technologies were discussedalong with the computational challenges theypresent.

AudeFoucher

PhDstudent

5 2001 Miss Foucher supplied data sets for the firstprototype database and evaluated the system.

AdrianCohen

PhDstudent

5 2001 Mr Cohen used proteomics to catalogue theexpressed proteins in the parasite Toxoplasmagondii and supplied test data for the first pro-totype database.

Dr ChrisWard

Post-doctoralresearcher

5 2002-2003 Dr Ward presented his work using proteomicsto identify the proteome of an organelle fromToxoplasma gondii and supplied data for test-ing.

Prof.WalterKolch


5 2003-2004 Prof. Kolch is head of a laboratory at theBeatson Institute for Cancer Research. Thefuture developments of proteome databaseshave been discussed on several occasions.

Alex vonKriegsheim

PhDstudent

3 2003 Mr von Kriegsheim is a researcher at the Beat-son Institute for Cancer Research and per-forms DIGE analysis. The coverage of theGla-PSI model was discussed in a series ofmeetings.

MoragNelson

PhDstudent

30 2002-2004 Miss Nelson is investigating the differential ex-pression of proteins in host cells when invadedby a parasite, compared with non-invadedcells. Miss Nelson produced the data that isanalysed in Chapter 6.

Prof. MikeTurner


5 2003-2004 Prof. Turner is head of a laboratory thatinvestigates the mechanism of action of try-panosomes and malaria. One of the techniquesemployed is proteomics. There have been sev-eral discussions of the requirements for thedatabase and the annotation of the genomesequence.

AnneFaldas

Researchassistant

20 2003-2004 Miss Faldas is cataloguing the proteome of theparasite Trypanosoma brucei (Chapter 7).

Table 3.1: A summary of the interviews held with researchers to formulate an understandingof proteomics research.

3.3 Previous work

Gla-PSI was released as a proposal for information that should be captured in a community

standard for proteomics, in addition to what is captured in PEDRo. In this section a

detailed description of PEDRo is given, along with a brief description of other data formats

for proteomics.


3.3.1 SWISS-2DPAGE

The SWISS-2DPAGE system was first established in the early 1990s as a web repository of

2-D gel data [153]. The web interface contains gel images overlaid with a map of spots which

has hyperlinks to other web pages for individual spot records. The spot records can be linked

to corresponding entries in the protein sequence database Swiss-Prot. The functionality of

the database is discussed in more detail in Chapter 5. The system utilises a textual data

format for specifying 2-DE and protein spot data, which is similar to the format of the Swiss-

Prot database, and was considered as a candidate format during the standardisation process

(see the SWISS-2D PAGE website for a sample record [309]). The format contains some

information about how the protein was identified, such as the peaks produced from mass

spectrometry, and can incorporate links to bibliographic references and other databases.

However, there is limited information about the protocols employed to create the gel. The

format does not include the method of scanning to create the gel image, or the software used

to analyse the image. There is also only a very limited minimum set of information that

must be supplied, therefore certain entries contain only the protein name, species of origin

and identifiers for the protein and gel. A data standard for proteomics requires a wider and

more complex specification of the minimum information that should be captured for each

protein entry.

3.3.2 GELBANK and HUP-ML

A similar format is produced by the GELBANK database (the data format is displayed in

Babnigg and Giometti 2004 [20]). The GELBANK text format is similar to SWISS-2DPAGE

but contains slightly different information about the gel protocol, and has different format-

ting. Protein spots are stored with the following information: gel position, the observed

molecular weight (MW) and charge (pI) of the protein, the theoretically calculated MW and

pI, the protein name and its accession number. There is no current facility for linking to MS

data that would enable the quality of the protein match to be assessed. The Japanese Human

Proteomics Organisation (J-HUPO) has also produced a proteomics data format, HUP-ML

(HUman Proteome Markup Language) represented in XML. HUP-ML has been presented at

past PSI meetings, and contains more detailed information about sample processing prior to

2-D gel electrophoresis. There is a DTD (document type definition) available for validating

HUP-ML [160]. The developers of HUP-ML are committed to the PSI development process

and will produce a mapping from HUP-ML to the finalised standard of PSI.


Figure 3.2: The complete PEDRo model represented in UML, reproduced from [315].


Figure 3.3: The classes that record biological samples in PEDRo, reproduced from [315].

3.3.3 PEDRo

The Proteomics Experiment Data Repository (PEDRo) from the University of Manchester

was created to address the requirements for a proteomics standard and covered four parts

of the analysis: sample generation, sample processing, MS protocols, and MS data analysis.

The complete PEDRo model is displayed in Figure 3.2, the four parts of the analysis are

represented by different shading in the four sections of the model. The sample generation

part is shown in Figure 3.3. An overview of the experimental hypothesis and citations for

methods and results are captured in the class Experiment. There is a relationship to the class

Sample and SampleOrigin for recording basic details about the type of material on which the

experiment is being performed, along with genotype information in Organism. PEDRo was

originally designed for capturing data from experiments with yeast, therefore the description

of sample is focused on cell cultures and has very limited facilities for recording any detailed

phenotype information about larger organisms.

Protein separation in PEDRo

Figure 3.4 summarises the classes for capturing protein separation techniques. The Sample

class is a subclass of Analyte (Figure 3.3) and separation techniques are modelled as sub-

classes of AnalyteProcessingStep. The substance on which a separation technique is per-

formed (the input) is modelled by a relationship from Analyte to AnalyteProcessingStep.

Sample, a subclass of Analyte, is thus directly related to the first separation technique

(AnalyteProcessingStep) performed on it. The separation techniques are modelled by


Figure 3.4: The part of PEDRo covering protein separation techniques, reproduced from[315].

classes, such as Gel, Column and ChemicalTreatment. The products of separation (outputs)

are modelled by the classes GelItem, Fraction and TreatedAnalyte. The inheritance re-

lationship enables a series of treatments to be specified where the product (output) of one

treatment becomes the input for another. 2-D gel data is represented by the attributes in

GelItem, and spots matched between gels can be captured in RelatedGelItem. The method

used to perform comparative gel analysis is not recorded in PEDRo.

Mass spectrometry in PEDRo

The third section models the type of ion source for a mass spectrometer and the machine

parameters (Figure 3.5). The protein sample, and its analysis, are represented by the re-

lationship from Analyte to MassSpecExperiment, enabling a link to a gel spot, column

fraction or output from another type of treatment.

MS data itself is represented in the fourth part of the model (Figure 3.6). The data in

MS is typically a list of peaks from an MS trace. Database searches that are carried out

to identify proteins from the MS data are captured by DBSearch and DBSearchParameters.

Peptides that are matched by the data are represented by PeptideHit and protein records


Figure 3.5: The model of MS ionisation and protocol in PEDRo, reproduced from [315].

Figure 3.6: MS data and database searches modelled in PEDRo, reproduced from [315].


that have been matched are modelled by ProteinHit and Protein. There is a relationship

between ProteinHit and RelatedGelItem that enables a direct link from gel spots to the

proteins to which they have been matched, without traversing the entire set of MS data and

analysis. There are a large number of attributes in most of the classes that are representative

of the properties of the experiment that researchers may wish to store. However, for certain

concepts it is very difficult to cover all the possible attributes, for example different database

search programs offer a large range of parameters that cannot all be explicitly specified in

the model. Therefore, the class OntologyEntry is used to specify additional attributes that

can be added where required, by obtaining the relevant term from a controlled vocabulary.

The development of an ontology for proteomics is introduced in Section 3.5.3, and there is a

detailed discussion of ontology usage in the following chapter.

3.4 Gla-PSI: A model for 2-D gel electrophoresis and analysis

This section includes a detailed breakdown of the components in Gla-PSI. The UML concepts

of classes and attributes are used to represent objects in a proteomics investigation, and

relationships have been created between classes to model the links between items in an

experiment. The complete model is shown in Figure 3.7, and the following sections describe

each part of the analysis in turn. A case study demonstrating how the model captures data

from a difference gel electrophoresis experiment is given in Appendix D.

3.4.1 Overview of the experiment and protein extraction

Gla-PSI does not contain a complete proposal for describing the overview of an experiment,

however we believe that there are classes in MAGE-OM that can adequately describe the

hypothesis of a proteomics investigation and the biological samples used. Experimental

protocols for recording protein extraction and solubilisation can also be described in MAGE-

OM. In our original publication describing Gla-PSI [176], the exact details of how protein

samples and protocols can be recorded in MAGE-OM were not given, however the following

chapter describes the complete integration.

3.4.2 Two-dimensional gel electrophoresis

A complex mixture of proteins can be separated by a number of techniques, including: two-

dimensional gel electrophoresis (2-DE), chromatography, affinity column and others. Gla-PSI

is focused around 2-DE, which is the most widely used technique for protein separation in


IDEvidence

MassSpec

The stages preceding image analysis have been presented in models: MAGE http://www.mged.org and PEDRo http://pedro.man.ac.uk

Class A

Class B

New classes inthe model

Classes derived from MAGE or PEDRo

Legend

Database

version : StringURI : String

Identifiable

identifier : Stringname : String

All classes are subclasses of Identifiable and Describable (not shown). Therefore, all classes can have an identifier attached and be linked to annotation classes.

ScannedImage

scanner : StringfileURI : Stringresolution : Doublecontrast : Doublebrightness : Doublewavelength : DoubledimensionX : IntegerdimensionY : Integer

ExternalReference

exportedFromServer : StringexportedFromDB : StringexportID : StringexportName : String

Describable

BibliographicReference

title : Stringauthors : Stringpublication : Stringpublisher : Stringeditor : Stringyear : Datevolume : Stringissue : Stringpages : StringURI : String

Database


Description


0..1 10..1 1

0..n1 0..n1

0..n

1

0..n

1

OntologyEntry

category : Stringvalue : Stringdescription : String 1..n 11..n 1

0..n

1

0..n

1

DatabaseEntry


1 0..n1 0..n

0..n

1

0..n

1

0..1 10..1 1

OntologyRef

0..11 0..11

Type

SpotRatio

id1 : Stringid2 : StringnormalisedRatio : Doublequality : StringratioType : String

DatabaseEntry


1

0..n

1

0..n

Parameter

parameterType : StringparameterValue : StringparameterUnit : String

0..n0..1 0..n0..1

DIGESingleSpot

volume : DoublepeakHeight : DoublenormalisedVolume : Double

0..1

2

0..1

2

0..n

0..1

0..n

0..1

Protein

id : StringmW : DoublepI : Doubleaccession : StringswissProtID : StringpirID : String

0..n

0..1

0..n

0..1

Parameter


DIGESingleImage

dyeLabel : StringisMasterGel : StringvolumeAverage : Double

0..n

0..1

0..n

0..1

0..n1 0..n1

SpotSets

ScannedImage


0..n

0..1

0..n

0..1

StatisticalAnalysis

software : Stringversion : Stringalgorithm : StringdataFile : StringanalysisType : String

Spot

volume : DoublenormalisedVolume : Doublearea : DoublepeakHeight : DoublexCoord : IntegeryCoord : Integerexperiment_pI : Doubleexperiment_mW : Doubleradius : Double

0..1

0..n

0..1

0..n

1..n

0..1

1..n

0..1

0..n0..n 0..n0..n

SpotRefs

spotID : String0..n

0..1

0..n

0..1

1..n

0..1

1..n

0..1

2D-PAGE

pI_start : DoublepI_end : DoublemW_start : DoublemW_end : DoublepercentAcrylamide : DoublesolubilizationBuffer : StringstainDetails : StringdimensionX : DoubledimensionY : DoubledimensionZ : DoubledimensionUnit : String

0..n

0..1

0..n

0..1

1..n

1

1..n

1

DIGEAnalysis1..n1 1..n1

0..1

1

0..1

1

1

1

1

1

ImageAnalysis

softwareName : Stringversion : StringfileURI : StringimageProcessing : String

0..n

1

0..n

1

0..n1 0..n1

Parameter

parameterType : StringparameterValue : StringparameterUnit : String 0..n

0..1

0..n

0..1

0..n

0..1

0..n

0..1

0..n

0..1

0..n

0..1

0..n 0..10..n 0..1

ExperimentDesign

ProteinPreparation

1

1

1

1

MultipleAnalysis

analysisType : String0..1

0..n

0..1

0..n

0..1

0..n

0..1

0..n

0..n

0..1

0..n

0..1

ExperimentParameters

1..n

1

1..n

1

1

1..n

1

1..n

0..1

0..n

0..1

0..n

MatchedSpots

quality : Stringdescription : String

0..n0..1

0..n0..1

1

1..n

1

1..n

0..n

1

0..n

1

0..n

0..1

0..n

0..1

Figure 3.7: The complete Gla-PSI object model represented as a UML class diagram.


2D-PAGE


ImageAnalysis


ScannedImage


1..n1 1..n1 0..n1 0..n1

Attributes in 2D-PAGE have been derived from the PEDRo model: http://pedro.man.ac.uk

Figure 3.8: A model of 2-DE data, and a scanned gel image.

proteomics. A standard for 2-DE must capture the conditions under which the gel was run.

The conditions include the dimensions and voltages applied to the pH strip, gel dimensions,

buffers, and staining procedures. Many of these parameters are covered in PEDRo, and

certain attributes from PEDRo have been reproduced in the 2D-PAGE class. Once a gel has

been run, there is a significant amount of information that must be captured, which is not

adequately covered in PEDRo. Initially, a gel is scanned and a raw image is produced. Gla-

PSI incorporates this process, recording the details of the scanner and the image produced, in

the class ScannedImage (Figure 3.8). The model allows multiple instances of a scanning event

to cover cases where researchers have re-scanned a gel, for example, at different resolutions.

ScannedImage has attributes for the resolution, contrast, and brightness of the image, to

allow different versions of the same image to be stored. The dimensions of the image in

pixels are also stored. The image derived from a scanned gel becomes the input for the next

part of the model: image analysis.

3.4.3 Image analysis

A number of software packages can be used to analyse scanned 2-D gels. The software is able

to perform edge detection on an image to determine the coordinates, volume, area, and other

properties of protein spots. The class Spot accommodates many of the properties produced

by current software packages, however it is not possible to include all measures that may be

produced by current or future software versions (Figure 3.9). A class containing attributes:

parameterName, parameterValue and parameterUnit is used to cover data types that are

not explicitly included in the model (shown on Figure 3.7). Values for these attributes will

be obtained from a controlled vocabulary to ensure consistency. The class ImageAnalysis


MatchedSpots


SpotRefs

spotID : String

1

1..n

1

1..n

MultipleAnalysis

analysisType : String0..n1 0..n1

Spot


1..n0..1 1..n0..1

ImageAnalysis


0..1

0..n

0..1

0..n

0..n1 0..n1

Figure 3.9: The classes capture data from image analysis applications, including multipleanalysis across a number of gels.

IDEvidence

MassSpec

Protein


DatabaseEntry


0..1 0..n0..1 0..n

Database

version : Str...URI : String

0..n

1

0..n

1

Spot


0..n0..n 0..n0..n

SpotRefs

spotID : String

0..1

1..n

0..1

1..n

MultipleAnalysis

analysisType : String

Parameter


0..1

0..n

0..1

0..n

0..1

0..n

0..1

0..n

MatchedSpots


1..n 11..n 1

1

0..n

1

0..n0..10..n 0..10..n

Figure 3.10: The relationship between spot data (Spot) and identified proteins (Protein).The evidence for a spot being matched to a protein, such as MS data, can be added to therelationship, although Gla-PSI does not have a specification of MS data.

records the software package and a description of image processing that has occurred.

3.4.4 Protein spots

There are separate classes for spots identified on a gel (Spot), and proteins (Protein) to

which spots may be matched (Figure 3.10). The relationship between Spot and Protein

allows one or more spot records to be linked to one or more protein records. The cardinality

is displayed by 0..n to 0..n on the relationship between the two classes. The relationship from

Spot to Protein is modelled in this way because there are known instances where a single

spot contains a number of different proteins. In the opposite direction, it is possible that

a particular protein arises in a number of different positions on one gel. The relationship


DIGESingleImage


DIGEAnalysis

1..n

1

1..n

1

DIGESingleSpot


0..n

1

0..n

1

SpotSets

1

1

1

1

SpotRatio


0..1 20..1 2

MultipleAnalysis

analysisType : String

0..1

0..n

0..1

0..n

Spot


0..1 0..n0..1 0..n

1..n

0..1

1..n

0..1

1..n1 1..n1

MatchedSpots


0..n

1

0..n

1

SpotRefs

spotID : String

1..n0..1 1..n0..1

1

1..n

1

1..n

Figure 3.11: Classes for storing difference gel electrophoresis data.

is linked via an attribute that includes the evidence for the match, such as any MS data

that is available. This is very important because any findings based upon the predicted

expression of a protein should take into account the probability that the protein has been

identified correctly. Gla-PSI does not model MS data but a finalised MS standard should

be integrated at this position. The Protein class contains sufficient information such that

a repository based on the model can link directly to external databases. A single protein

may have entries in a number of databases that may be relevant to the experiment, such as

GenBank, Swiss-Prot, PIR, or domain specific databases.

Image analysis applications have the ability to match spots on different gels, believed

to correspond to the same protein: from replicate gels for the same samples, or from gels

over which a sample condition is varied, such as a time course experiment. Spots that have

been matched are linked via a specific class, MatchedSpots, and the class MultipleAnalysis

stores a description of the type of matching that has been carried out.

3.4.5 Two-dimensional difference gel electrophoresis

Data produced from a difference gel electrophoresis experiment is captured in Gla-PSI as

shown in Figure 3.11. Amersham Biosciences produce DIGE (Difference In Gel Electrophore-

sis) technology and DeCyderTMsoftware, for analysis of gels [74]. DeCyder can export data

in an XML format, known as DeCyderML (personal communication from Amersham Bio-


Figure 3.12: The part of Gla-PSI modelling statistical analysis of a proteomics experiment.

sciences), which we have mapped to Gla-PSI. A single DIGE gel can produce several images,

corresponding to the fluorescent dyes used for different samples. DeCyderML contains a

class for the single channel image, with attributes such as dye type, which corresponds with

the class DIGESingleImage. DIGESingleImage is a subclass of ScannedImage and therefore

inherits all the attributes from ScannedImage. The class DIGESingleSpot models spots that

have been identified from a scan of the gel at a single wavelength. DeCyderML also has a

class for storing information about spots that have been calculated from a combination of

scans at several different wavelengths (co-migrated spots). This data can be recorded in the

general Spot class in Gla-PSI. DeCyderML includes information about spots that have been

matched across gels, which can be recorded using the MultipleAnalysis and MatchedSpot

classes that exist in Gla-PSI for storing non-DIGE data. DeCyder software calculates ratios

between pairs of single image spots that have co-migrated, captured in SpotRatio.

3.4.6 Statistical analysis

Statistical analysis techniques, such as ANOVA [88] (analysis of variance), are used to locate

spots whose volume is significantly different between two samples, indicating a change in

protein expression under a certain condition. It is vital that the exact details of the analysis

are preserved to ensure that the same procedure can be reproduced by other research groups.

A number of statistical techniques can be applied to large data sets, such as analysis over a

number of replicates, or over a number of gels analysing a varying condition. An example

is cluster analysis, as performed on microarray data sets [86], to detect groups of proteins


sharing similar expression patterns over a number of gels. The StatisticalAnalysis class

accommodates a description of the software or algorithm used to perform the analysis, and

the appropriate parameters and significance levels used (Figure 3.12). Gla-PSI has a link from

a description of the analysis to the raw data. The analysis can be linked to individual spot

records, or spots matched between gels. A formal description of statistical analysis presented

by Papageorgiou [239] covers most of the attributes that are applicable to proteomics analysis,

but is possibly too complex for use in biological applications. Gla-PSI has few attributes,

with the intention that the details of the analysis will be described with data types obtained

from controlled vocabularies. It is desirable that future versions of a proteomics standard

incorporate future statistical standards.

3.4.7 Annotation

Gla-PSI is designed to allow annotation of all aspects of the experiment including raw data,

experimental protocols and analysis (Figure 3.13). Annotation may be in the form of free

text or links to external databases or ontologies. MAGE-OM includes classes that allow

annotation to be added and linked to any other part of the model, which have been included

in Gla-PSI. Gla-PSI uses inheritance, such that all classes in the model are subclasses of

Identifiable, inheriting the attributes that allow a name and identifier to be added to each

class (Figure 3.13). Identifiable is a subclass of Describable, which has a relationship

to the annotation classes. Every class also inherits from Describable, enabling all classes

to be linked out to other database entries, or bibliographic references.

3.5 Future developments in proteomics standards

The PEDRo proposals have been presented to the research community and there have been

three annual meetings of PSI [236, 237] at which the model has been discussed and refined.

Currently, a new object model is in development, known as PSI-OM (Proteomics Standards

Initiative-Object Model). The rest of the section discusses proposed additions and changes,

giving a snapshot of the current development as of July 2004. There are several parts of

the original proposal that were not modelled correctly and certain omissions, including those

covered in more detail in Gla-PSI.


Protein SpotRatio DIGESingleSpot2D-PAGE ScannedImage

MultipleAnalysisMatchedSpots StatisticalAnalysisImageAnalysis DIGEAnalysis

Spot SpotSets

Identifiable


ExternalReference

exportedFromServer : Str...exportedFromDB : StringexportID : StringexportName : String BibliographicReference


Database

version : Str...URI : String

OntologyEntry

category : Stringvalue : Stringdescription : String

DatabaseEntry

accession : StringaccessionVersion : Str...URI : String

1

1

1..n1..n 1

0..n

1

0..n 0..1 10..1 1

OntologyRef

0..11 0..11

Type

Description


0..1 10..1 1

0..n

1

0..n

1

0..n

1

0..n

1

0..n

1

0..n

1

Describable

1

0..n

1

0..n

Figure 3.13: Several classes are subclasses of Identifiable, enabling a unique identifierand name to be attached. Each class is also a subclass of Describable enabling links tobibliographic references and external database entries to be specified.

3.5.1 An overview of PSI-OM

An overview of the new model is displayed in Figure 3.14. The main features of

the experimental techniques are similar to PEDRo, with a cycle from Analyte to

AnalyteProcessingStep. There has been no current effort to specify a detailed descrip-

tion of a sample within the model, however there is a relationship from SourceInformation

to OntologyEntry to specify characteristics of a sample. At the top level is the class

MIAPEDataSet for clustering a set of related proteomics experiments, below which is the

top level of one complete analysis (Project). The concept of a StudyGroup has been in-

troduced for comparing one set of samples with another. For example, an experiment is

performed to compare mice with a gene knockout X, against wild-type mice. Ten gels are

performed, of which five are replicates from pooled samples of knockout mice tissue, and five

are replicates from wild-type. An instantiation of PSI-OM would contain one instance of

Project and two instances of StudyGroup (one for wild-type and one for knockout). The

source of biological material is captured in the class Source. The model allows either: 10

sources of material to be specified for biological replicates (10 different mice) or two sources

of protein that is subsequently split, using the classes Subdivision, for specifying technical

replicates.


PercentOfComponent Timepoint

1 11 1

MobilePhaseComponent

1..n

1

1..n

1

Column

SampleLoading

Fraction ColumnRun

1..n

1

1..n

1

1

0..n

1

0..n1..n

1

1..n

1

0..n 10..n 1

CombinedAnalytes Combination

1 2..n1 2..n

AnalytePortion Subdivision

2..n 12..n 1

TaggedAnalyte TaggingProcess

1 11 1

Description

RunDetails

StudyDescription

experimentalFactor

Analyte AnalyteProcessingStep

0..n1 0..n1

Protocol

Source

1..n

1..n

1..n

1..n

OntologyEntry

SourceInformation 1..n1..n 1..n1..n

+type

+characteristicsOtherAnalyte

OntologyEntry

1

1 +type

1

1

OtherAnalyteProcessingStep

0..n0..n 0..n0..n

1

1+type

1

MIAPEDataSet

StudyGroup

0..n

1

0..n

1

Project

hypothesis

0..n

1

0..n

1

0..n

1

0..n

1

RunDetails

Description

PhysicalGelSpot Gel2D

1

1

1

1

1

1

1

1

0..n 10..n 1

Analysis

0..n1 0..n1

StudyGroupDataSet

1 1

ExpressedProtein

1 1..n1 1..n

Gel1D

1

1

1

1

1

1

1

1

PhysicalBand Gel1DLane

0..n

1

0..n

1

0..n 10..n 1

1

1 1

Figure 3.14: A draft version of the main components of PSI-OM.


See DataModel diagram for link between Image, ImageAnalysis and IdentifiedSpot / Band

Gel1DGel2D

Image

URI : Str...

ImageAcquisition

0..1

0..n

0..1

+scans1DGel

0..n0..n

0..1

+scans2DGel

0..n

0..1

0..n

0..1

0..n+createsImage

0..1

DatabaseEntry

IdentifiedBand

Analysis

IdentifiedSpot

DIGECompositeSpot OntologyEntry

MSDataCapture

ProteinRecord

0..n

0..1

0..n

0..1

PhysicalBand

0..n 10..n 1

StudyGroupDataSet

1

1

PhysicalGelSpot0..n 10..n 1

0..n

1

0..n

1

ProteinModification

1

1

+type

1

1

MSDataSet

0..1

ExpressedProtein

0..n

0..1

0..n

0..1

0..n0..1 0..n0..1

1..n

1

1..n

1

0..n0..1 0..n0..1

0..n1 0..n1

0..n

0..1

0..n

+proteinIdentification

0..1

Fraction0..n0..1 0..n0..1

1

1

0..1

+containsProtein

Figure 3.15: Part of PSI-OM showing the relationships between spots identified on a gel andthe corresponding protein records.

3.5.2 Data model in PSI-OM

The diagram in Figure 3.15 displays the overview of a proteomics data set. A number

of experiments are packaged together using the class StudyGroupDataSet. The core data

point is an ExpressedProtein which can be linked to a set of classes describing the result

of separation techniques (PhysicalGelSpot, PhysicalBand, Fraction and so on). The

class ExpressedProtein will capture a complex concept, as follows. In a 2-DE experiment

particular proteins may appear in multiple positions on a 2-D gel, which may be the result of

differential splicing of gene products or chemical modifications to the protein. These variant

forms of the protein will usually only be identified by a single protein name or accession

number, however it is vital that the alternative forms are differentiated in the model. An

ExpressedProtein is intended to capture the idea of a single protein form that arises in one

position on a gel, or in one column fraction, resulting from the set of modifications that it

has. If the nature of the modification is known, it can be captured in ProteinModification,

and a reference to a record in a sequence database can be captured in ProteinRecord and

DatabaseEntry. The current model has no detailed specification for MS standards because

these are in development by a separate organisation, and will be added to the model when

finalised.

The draft model of protein spot data arising from image analysis has been influenced


OntologyEntry

MultipleGelAnalysis

Image

URI : String

1

1

+format

1

1

ImageAnalysis

SpotsMatchedAcrossGels

1..n

1

1..n

1

SingleGelSpotSet0..n0..1 0..n0..1

1

0..1

1

0..1

11 11 DIGESingleSpotSet1 11 1

DIGEAnalysis

10..1 10..1

1..n

1

1..n

1

Image

URI : String

MultipleGelAnalysis

OntologyEntry

IdentifiedSpot

0..n0..1

0..n0..1

0..n

1

0..n

1

DIGESingleSpot

1

1..n

1

1..n

DIGESpotSet

1

1

1

1

0..1

0..1

+compositeImage

SpotsMatchedAcrossGels

1..n

1

1..n

1

SpotMeasurement

value : Double

0..1

0..1

+unit

0..1

0..1

1

1

+type

1

1

0..n0..1 0..n0..10..n

0..1

0..n

0..1

DIGECompositeSpot

1

1..n

1

1..n

1..n

1

1..n

1

0..n 0..10..n 0..1

0..n

0..1

0..n

0..1

0..1

0..1

Figure 3.16: A draft version of the protein data model in PSI-OM. The classes on the leftmodel conventional 2-DE and the classes on the right represent difference gel electrophoresis.

by Gla-PSI, and is displayed in Figure 3.16. There are two separate sets of classes for

modelling gel electrophoresis data. The classes on the left of Figure 3.16 model standard

gel electrophoresis, in which one sample is applied to one gel, and multiple samples are

compared on different gels. The classes on the right model data resulting from a DIGE

experiment, in which there are two kinds of spot data: spots arising from scanning a gel at

a single wavelength (DIGESingleSpot), and spots arising from a composite image that has

been calculated from the single channel images (DIGECompositeSpot). The attributes that

will be assigned to classes are still to be finalised, but one issue that must be resolved is the

extent to which ontologies will be utilised. It is possible to include many attributes in the

model for describing protein data, or put the types of attributes in a controlled vocabulary

and link many classes to OntologyEntry. This is an area for future discussion but we believe

that there are considerable advantages to using ontologies extensively, because the controlled

vocabularies can be updated at regular intervals, allowing gradual evolution of the coverage

of the model. It is not possible to update an object model at regular intervals without

generating backward compatibility problems.

3.5.3 An ontology for proteomics

The original PEDRO model used ontologies sparingly, taking the approach that an initial

model for proteomics should function as a document for specifying the main components of


a typical workflow to stimulate discussion in the community. The Gla-PSI proposal specifies

that ontologies are required to capture certain parts of the analysis, but there are currently no

major ontologies containing proteomic experimental terms. Therefore, it has been recently

proposed that the MGEDOntology should be extended for proteomics. The MGEDOntology

(MO) includes a controlled vocabulary of terms describing microarray experiments, including

the details of biological samples (described in more detail in the following chapter). There

is no difference between the sample prior to mRNA extraction for a microarray assay or

protein extraction for proteomic analysis, hence parts of MO can describe biological samples

for proteomics. A new ontology, PSI-Ont, is in development and will include terms describing

proteomic experimental techniques. PSI-Ont will be developed as an extension to the MGED

Ontology, and will follow the same structure.

3.5.4 Minimum information about a proteomics experiment

An essential stage in improving the process of exchanging and publishing microarray data

was the release of the MIAME guidelines [41]. MIAME is a checklist of the information that

should be made publicly available to allow the data sets to be re-analysed, or to allow the ex-

periment to be reproduced, if identical biological samples are available. An equivalent effort

has been initiated by PSI to develop MIAPE (Minimum Information About a Proteomics

Experiment). The guidelines will be formalised after a series of meetings and discussions

via the mailing list. In overview, we believe that MIAPE should contain the following. It

is vital that sufficient description of the biological samples is given so that the validity of

each study group can be established. Researchers should also publish the protein extraction

protocols, detailed descriptions of the protein separation techniques, and the equipment and

protocols utilised for MS. Any software that is used to analyse data should be reported with

a version number, vendor name and contact details. If database searches have been carried

out to identify proteins, there should be a date stamp of when the search was carried out if

the database is updated daily, or a version stamp if the database is released less frequently.

3.6 Discussion

3.6.1 Web access to date

It has been recognised that past funding for large databases of scientific data has not been

sufficient, and as a result, important information is lost [209]. An activity which attempts


to remedy this situation is the effort to develop biochemical pathway databases, such as

KEGG [184]. Information regarding reaction kinetics and functional information has been

published over several decades, but is not generally available in electronic form. Only papers

published in the last decade may be available on the Internet, and data is not presented in any

kind of format that can be mined automatically. Instead, information retrieval techniques

must be used with significant manual intervention. This process is time consuming and will

miss substantial amounts of information. Today, data regarding one biological system is

often too extensive for a single researcher to gain access to by reading published literature,

and automated methods are required. Microarray experts have previously recognised these

needs and efforts are underway to develop large central repositories [42]. In recent years a

parallel effort has been initiated by proteomics researchers, however there are currently no

major central repositories of proteomics data [252]. A standard data format will facilitate

the creation of a central repository that will allow re-analysis of published data as new

statistical techniques are developed. Microarray and proteomics experiments generate large

amounts of data that is of potential use to researchers in many other fields. In particular,

the studies can improve genome annotation by demonstrating conditions in which genes or

proteins have been shown to be up or down regulated, allowing researchers to improve the

functional annotation.

3.6.2 Status of proteome standards

This chapter documents the development of the Gla-PSI model, which we released in October

2003. Gla-PSI represents data from one section of a proteomics workflow and complements

other work undertaken by various organisations. PSI is overseeing the development of a

standard, and is using PEDRo as an initial framework from which to develop a unified

model. Gla-PSI covers image analysis of 2-DE, multiple gel comparison, DIGE and statistical

analysis of large data sets, and represents additional information that should be included

in the next version of the community standard. Capturing experimental protocols in a

structured format is a major challenge due to the enormous range of possible experiments

that could be performed. The MAGE format for microarray has been designed with a flexible

structure that allows it to be extended into new technologies by using ontologies. Gla-PSI

utilises parts of MAGE for adding additional annotation and bibliographic references to

the model. In our original publication on Gla-PSI [176], we stated that classes derived

from MAGE should be used for capturing information about experimental protocols and the


biological samples on which experiments are performed but at that time the integration had

not been completed. The following chapter describes later work, which is the integration

of Gla-PSI, PEDRo and MAGE to create a framework for capturing data from a range of

functional genomics techniques.

In Section 3.5, the development of the next version of the official PSI object model (PSI-

OM) was discussed, which incorporates parts of Gla-PSI and PEDRo. The development of

the object model will take place in conjunction with the creation of an ontology for pro-

teomics (PSI-Ont), which will be regulated by PSI. An important first stage will be the

creation of a document that specifies the minimum information set that must be published

alongside proteomics data to allow future re-analysis (MIAPE). The development of all three

components (PSI-OM, PSI-Ont and MIAPE) will continue with discussions at official meet-

ings of PSI, and via an email mailing list. The development of a finalised standard requires

significant contribution from the proteomics community before consensus can be reached.

The complete model should be flexible with regard to new technologies and experimental

protocols. A data standard should not prescribe how researchers carry out experiments,

but should capture enough detail to ensure that useful data archives can be developed. If

a standard is to be accepted, tools must be developed which enable researchers to capture

data conforming to the standard without substantial manual data entry. Laboratory Infor-

mation Management Systems (LIMS) are available from commercial software vendors. They

capture instrument parameters, and track solutions using bar-coding. It is likely that future

versions will be specifically tailored for proteomics applications, and software vendors should

provide an output file conforming to the proposed standard. A data set containing 2-DE

images, MS traces, analysis and annotation is fairly bulky, therefore the development of a

single public database covering all aspects of proteomics is unlikely for all species. A more

feasible solution is the development of distributed, domain specific proteome databases, such

as single organism, or disease, with data transfer between databases occurring via an XML

data format, created from the object model. It is essential that databases provide wide

ranging query facilities to enable the development of applications that search for data sets

of interest. Data integration applications will be developed to link proteome databases to

other repositories, such as databases of sequences, motifs and structures.


3.7 Conclusions

Gla-PSI has been developed to represent 2-DE, image analysis, difference gel electrophoresis

and statistical processing. It was initially developed at the same time as the PEDRo proposal,

however it was later modified and released to document additional information that should

be recorded in a community wide standard. The model has influenced the development of

the next version of the standard, PSI-OM.

The microarray field has recognised the need for central data repositories and exchange

standards for some time. The additional complexity of proteomics experiments means that

the efforts are some way behind, and there are still no databases that offer access to protein

separation information, quantification data and mass spectrometry. The development of

a proteomics data standard will enable data to be sent to a public database. Chapter 5

describes a prototype system that could serve as a centralised public database for proteomics.

The database stores protocols and data from 2-DE and MS, and facilities for integration with

microarray results are demonstrated. We believe that the efforts of MGED in the microarray

field can be used directly for proteomics, and in the following chapter there is a description

of the unification of the proteomics proposals with MAGE-OM, to create a proposal for

standard across the whole of functional genomics techniques.

Chapter 4

Development of a data standard for

functional genomics

4.1 Introduction

In Chapter 2, the importance of data standards for life sciences was outlined and this was

further exemplified in the previous chapter with a description of the development of a data

model for proteomics. The success of the MAGE-ML format for microarrays demonstrates

the feasibility of a community wide standard for capturing data from a diverse range of

experiment types. This chapter covers the integration of the Gla-PSI model into a wider

proposal for functional genomics, which was published in July 2004 [175], which includes

substantial detail from MAGE-OM, and the draft standard for proteomics, PEDRo. The

new model is known as FGE-OM (Functional Genomics Experiment - Object Model) and has

been presented to the standards organisations for proteomics and microarrays as a proposal

for the integration of the current efforts in both fields.

URL: www.gusdb.org/fge.html

4.1.1 Requirements for standards

The motivation for integrating the current proposals for microarrays and proteomics is as

follows. It is becoming common for research groups to carry out experiments using multiple

types of technology as the cost of performing experiments has fallen. Several institutions

have semi-automated facilities offering a service for performing parts of experiments that

were previously very labour intensive. The functional genomics facility in Glasgow is one

example, offering a sequencing, microarray and proteomics service to researchers [293]. Re-

searchers now generate large volumes of data from diverse techniques that they wish to

107

Chapter 4. Development of a data standard for functional genomics 108

compare, or analyse side by side. There are several facets of experiments that can be de-

scribed using the same terms. An overview of a functional genomics (FG) experiment can

be described with a text description of the hypothesis, and a parameter that is varied be-

tween different samples, such as the different time points in a time course experiment. The

biological samples used in any type of FG experiment should be described using common

terms because this stage precedes the extraction of mRNA, proteins or metabolites and could

potentially be analysed downstream using any of the experimental techniques. Experimental

protocols from microarrays, 2-DE and other separation techniques can be described as a set

of sequential steps involving substances, actions and equipment. It may also be desirable

that all experiments are annotated with an audit trail, capturing when, where and by whom

the experiments were carried out. Data points in an FG experiment are usually genes or

proteins which may be quantified or localised in one sample compared with another. It is

therefore possible to create a framework containing the common parts of FG analysis as

part of an all encompassing data format. A shared format that has wide community accep-

tance would allow developers to create software capable of formatting all locally generated

FG data into one format that can be exchanged with other researchers or sent to public

databases. The format should be suitably designed such that there is no great overhead if

research groups wish to use only a subset of the entire model, for example if they are only

performing proteomics. It is likely that one single model for FG will require significantly less

effort for developers than creating software to manage four or five separate formats. Finally,

if experimental protocols are captured in a common format it will open up new possibilities

for comparing data produced from different methodologies, allowing researchers to have a

view of the biology that is nearer to the whole system level.

An integrated data format will also facilitate the development of public repositories for

storage and querying of functional genomics data. Microarray experiments are used widely

because a large number of assays can be performed concurrently, producing a large number

of possible leads about the genes that are significantly associated with a particular condition

or disease. However, while it had previously been believed that there is a correlation between

the expression of mRNA and protein [115], more recent studies have indicated that mRNA

level is a poor indicator of protein abundance [178]. Proteomics experiments can determine

the relative level of protein produced, therefore would be expected to be a better indicator

of the level of protein activity. Proteomics experiments can also give information about

post-translational modifications, which may have important effects on the function of the


protein [240]. It is therefore desirable that microarray and proteomics data can be queried

in parallel to determine the extent of gene expression and the level of encoded protein that

has been observed for a particular gene. Protein and RNA expression data should also

be accessible with genomic data, to allow better annotation of the genome with functional

information derived from FG studies, such as protein X is up regulated under condition Y.

A current example of this functionality is offered by the SOURCE database [78], which can

be queried by gene name, and returns textual annotation about the gene, and the relative

expression values from different microarray studies in which it has been assayed. Single data

points from a microarray experiment may not be sufficiently powerful to determine how much

active protein was present in the sample at that time, but can provide functional evidence

if a gene is strongly expressed in a sample or condition, or conversely not expressed where

it might be expected. These kinds of results can be assayed by further experimentation and

lead to the formation of new hypothesis about the function of genes and systems as a whole.

Functional genomics databases should also incorporate information from other types

of study: immunohistochemistry and protein interaction studies, such as yeast two-hybrid

[107]. Such systems would enable data mining applications to be developed that search for

the factors that affect regulation of transcription and translation, and ultimately, protein

function. Integrated databases will aid the development of mathematical models capturing

the effects of changes at the system level, and could provide source data for the modelling

of metabolic pathways [336]. Data mining algorithms could then be employed to search for

genes that may be important in a condition of interest, such as drug targets for a particular

disease.

4.1.2 Status of standardisation

Data standards for proteomics, and other FG experiments, are at a much earlier stage than

microarrays (Figure 4.1). PEDRo was released as a draft proposal to stimulate community

discussion about what was required in a data standard and, aside from the data capture

tool released with PEDRo (PEDRoDC), there have been few implementations of PEML

(Proteomics Experiment Markup Language), the XML-based data exchange language based

on PEDRo. This is because PEML is a complex format, and therefore considerable effort

is required by developers to create software that produces PEML. Furthermore, the benefits

of producing output in PEML at this time are limited, because there are no major public

repositories that accept PEML as input. There are also several parts of PEDRo that do not


Formation of PSI

Release of PEDRo

Release of Gla−PSI

Developmentof PSI−OM

Developmentof MAGE v.2

1999 2000 2001 2002 2003 2004 2005 2006

Formation ofMGED guidelines

MIAME

published

1996

Microarray Standards

Proteomics Standards

Advent ofmicroarrays

Release of FGE−OM andSysBio−OM

First objectmodel toOMG

Release of MAGE−MLv.1

v.1

First largescaleexperiments

Figure 4.1: A time line displaying the emergence of microarray and proteomics technology,and the efforts to standardise data formats.

adequately capture a proteomics workflow, the most important being insufficient descriptions

of biological samples, and no support for auditing. These two areas are captured in MAGE,

and this part of the object model has been refined over a significant period by a team

of experienced developers. It is vital that the next round of development in proteomics

standards makes extensive use of the experience gained in the development of MAGE. This

process has already begun with several MAGE developers giving oral presentations at the

2004 meeting of the PSI in Nice, France [257].

FGE-OM offers a possible framework for developing a standard across all FG experiments,

however an alternative proposal has been released known as CEBS (Chemical Effects in

Biological Systems) SysBio-OM. SysBio-OM was released after the creation of FGE-OM

therefore was not available for analysis at the time of development. A comparison of the

features offered by the two systems is made in Section 4.4. The future development of MAGE-

OM and the PSI data standard should take place jointly, using FGE-OM and SysBio-OM

as a framework around which it can be coordinated.

FGE-OM captures microarray and proteomics data, including separation techniques such

as two-dimensional gel electrophoresis (2-DE), and protein identification by mass spectrom-

etry (MS). The model also stores experimental protocols, raw data and data analysis. FGE-

OM comprises three namespaces that organise the classes in logical subsets: BioOM, Ar-

rayOM and ProteomicsOM (Figure 4.2). Substantial detail from MAGE-OM has been


FGE-OM

Components common to all functional genomics experiments

Microarray specfic components

Classes modelling proteomicstechnologies

Top-level of theObject Model

Namespaces

BioOM

ArrayOM

ProteomicsOM

Figure 4.2: An overview of the FGE-OM object model. The model is divided into threenamespaces: BioOM, ArrayOM and ProteomicsOM.

used to develop BioOM (the part of the model that is generic), and ArrayOM (the parts

of the model specific to microarrays). BioOM contains a set of packages and classes that

describe an experiment using microarrays, proteomics, or potentially other functional ge-

nomics techniques. The ProteomicsOM namespace captures information from proteomic

specific technologies. The object model has been implemented as a relational database,

known as RAPAD (RNA And Protein Abundance Database), which is described in the

following chapter.

The rest of the chapter is structured as follows. Section 4.2 outlines the methodology used

to create FGE-OM. A detailed description of FGE-OM is given in Section 4.3 and Section 4.4

briefly describes the contents of the alternative SysBio-OM proposal, and compares it with

FGE-OM. Finally, a plan for how the development of an integrated standard for functional

genomics can take place is outlined in Section 4.5.

4.2 Methods

FGE-OM was developed using an evolutionary software development model. MAGE-OM

was imported into a UML editing tool, and changes were made to accommodate proteomics

data. The PEDRo object model has not been released in UML format, however a database


schema has been released in SQL, which matches the object model very closely. Therefore,

the PEDRo database schema was reverse engineered and imported into the editing tool.

Additional classes were added manually from Gla-PSI where required. The initial develop-

ment involved the creation of class diagrams to model parts of proteomics experiments, using

components derived from MAGE-OM where possible. This was followed by a phase of dis-

cussion between several developers to test whether hypothetical proteomics workflows were

adequately covered in the object model. In cases where FGE-OM did not correctly model

a possible workflow, refinements were made to the model. The model was further refined

after the objects had been mapped to relations, and deployed as a relational database. At

the time of development there had been no complete implementation of the PEDRo model

or database schema, therefore several classes had to be refined to reflect real data sets.

FGE-OM was developed in UML using the modelling tool PoseidonTM[249], into which

the source models were imported. Poseidon has the advantage over other tools that there

is a version that is freely available, offering sufficient functionality to view and edit UML

class diagrams. It is vital that as many developers as possible have access to the object

model, beyond being able to view images of class diagrams. The main alternative, Rational

Rose [266], is expensive software which precludes many researchers from analysing models.

There is a major compatibility problem between the UML versions specified by different

vendors. UML is intended to be standard notation but there is currently no robust method

of transferring models between tools. An interchange format for UML, XML Metadata

Interchange (XMI) [356], has now been defined that may improve compatibility in the future,

but the current implementations of XMI only specify the contents of the model, not the

diagrams that have been drawn to represent the model. Therefore, once an object model

has been imported, diagrams must be redrawn by the developers, which is a laborious task.

4.2.1 Ontologies

An ontology can be described as the result of knowledge capture about a particular domain, in

a formal structure [138]. The use of ontologies in life sciences is rapidly increasing, because it

is believed that they can improve facilities for data re-use and integration [300]. The MGED

Ontology (MO) has been created to capture terms used in a microarray experiment [304].

Each entry contains a term, a definition and a specification for where the term should be used

in the model. An example term viewed with the OilEd editor [28] is displayed in Figure 4.3

(OilEd is described in Chapter 2). The ontology contains classes, properties and instances


Figure 4.3: A screenshot of the term “Age” in the MGED Ontology viewed with OilEd.

(individuals in OilEd). A class is the type of information (e.g. Age), the properties of the

class are its attributes (e.g. “has Measurement” and “Initial time point”) and the actual

values are the instances (e.g. years). There is also a definition of the term, in the case of

Age: The time period elapsed since an identifiable point in the life cycle of an organism. If

a developmental stage is specified, the identifiable point would be the beginning of that stage.

Otherwise the identifiable point must be specified such as planting.

The class OntologyEntry from MAGE-OM is used widely to store terms obtained from

controlled vocabularies, along with the source of the vocabulary. Ontologies are vital for

capturing the complexity of biological samples used in functional genomics.

EXAMPLE: Two FG experiments are performed, the biological material of the first is a

cell culture grown in a specific medium, and the second is a tissue sample from a person

suffering from heart disease.

It would be extremely difficult to engineer a schema to capture this range of information in

a structured way. For example, without an ontology, a model to capture the species of origin


could be designed with a class Species and an attribute scientificName. However, this

can pose major problems for querying due to the different ways a name could be represented,

consider: abbreviations, different classification systems and user errors. This problem was

avoided in MAGE-OM, by designing classes that have a relationship to OntologyEntry, for

instance called speciesName. The model would be instantiated by obtaining the value from

a taxonomic database, along with an ID number and a URL pointing to the source data.

In FGE-OM, OntologyEntry is used in this way in all three namespaces, and many of the

terms in the MGED Ontology can be used for both microarrays and proteomics.

EXAMPLE: A comparative 2-D gel analysis is being used on tissue from the hearts of two

samples of mice, one of which has a genetic defect. One characteristic that the researchers

want to capture is the gender of the mice.

The gender is specified by a relationship from the class BioMaterial to OntologyEntry

called Characteristics. OntologyEntry captures the category (Gender), the value (Male)

and the term’s definition. In many cases the usage is more complex because classes in the

ontology can have subclasses to build up a hierarchical structure, in fact Gender is a subclass

of Sex. The hierarchy is expressed in the object model by a reference from one instance of

OntologyEntry to another. The overall effect of the use of ontologies is the delegation of the

task of describing the domain to a different process, the ontology development, instead of

representing all concepts in the object model. This is advantageous because ontologies can

be easily extended without affecting the core functionality, but an object model must stay

fixed for a significant period of time, and cannot gradually evolve.

The MGED ontology will be extended further to incorporate standard terms used in

protein studies. There are examples of how the ontology has been implemented in a relational

database in the following chapter (page 170). Other ontologies, such as the Mouse Anatomy

Ontology [45] and the Plant Ontology [247], can also be used to describe biological samples

where required. The usage of other external ontologies will be vital because the MGED

Ontology will never contain all the terms to describe any kind of sample on which microarrays

could be performed. However, separate ontologies will become available from specific research

communities and, as long as the source and definition of a term is clearly stated, then

structured descriptions of biological samples can be captured. This will greatly improve the

facilities for querying databases in the future to find relevant data sets.


Figure 4.4: A complete listing of the packages within FGE-OM.


Experiment Protocol Bio-Material

Measure-ment

BioAssay BioAssay Data

BioEvent DescriptionBio-

SequenceBQS

HigherLevel

Analysis

Audit And Security

Identifiable

Extendable

Describable

Packages Classes

Figure 4.5: The packages and classes in the BioOM namespace of FGE-OM. The boxedpackages have been altered from MAGE-OM, others are identical to packages in MAGE-OM.Open arrows indicate inheritance, for example Identifiable is a subclass of Describable(the superclass) and inherits all the attributes from Describable.

4.3 Overview of FGE-OM

FGE-OM models microarray and proteomics data and a complete listing of the packages

and classes is given in Figure 4.4. All classes in BioOM and ArrayOM are derived from

MAGE-OM. In ProteomicsOM, classes in the packages MassSpecData, MassSpecProtocol

and ProteinSeparation have been derived from PEDRo, classes in ProteinData and Pro-

teomeBioAssay are from Gla-PSI, and ProteinRecord contains newly created classes. In the

rest of this section there is a description of the three namespaces, and the relationships that

exist between classes residing in different namespaces. The use of the model in the context

of a sample biological workflow is also described. A set of detailed diagrams, displaying the

attributes of classes and the cardinality of relationships, is displayed in Appendix B.

4.3.1 BioOM

Figure 4.5 shows the packages in the BioOM namespace. BioOM covers the components

in FGE-OM that are common to all experiment types. The majority of the packages are

identical to packages of the same name in MAGE-OM, as described in Chapter 2, and the

technical documentation that describes MAGE-OM can be obtained via the MGED web site

[212]. There are components of packages BioAssay and BioAssayData (from MAGE-OM)

that contain array specific information, which have been placed in newly created packages

within the ArrayOM namespace. The three abstract classes at the top-level: Extendable,


Array

Array

BioAssay

ArrayDesign

Array

BioAssayData

Quantitation

Type

DesignElement

Figure 4.6: The packages in the ArrayOM namespace. The boxed packages are newly createdin FGE-OM but contain a number of classes derived from MAGE-OM. The other packagesare identical to packages with the same name in MAGE-OM.

Describable, and Identifiable are unchanged from MAGE-OM, and most classes inherit

their attributes. Identifiable allows a name and an identifier to be added to classes.

Describable enables links to external ontologies, data ownership and an audit trail to be

attached. Extendable enables a triplet of attributes: Name, Value, Type to be attached to

any class for storage of properties that are not recorded in other parts of the model.

The BioAssay package in MAGE-OM contains a class describing the hybridization

of mRNA to an array. This class has been relocated in our model to ArrayOM,

and a new package (ArrayBioAssay) has been created in ArrayOM containing the

Hybridization class. The rest of the classes in BioOM:BioAssay are the same as in

MAGE-OM. The BioOM:BioAssayData package contains only five classes: BioAssayData,

BioAssayDimension, MeasuredBioAssayData, BioDataTuples and BioDataValues. The

five classes are identical to those in MAGE-OM. These classes specify the general structure

and location of data from any type of experiment and therefore reside in the BioOM names-

pace. BioAssayDimension allows experimental data to be packaged together across a range

of conditions, such as multiple array or multiple gel comparison.

4.3.2 ArrayOM

Packages unchanged from MAGE-OM

The ArrayOM namespace (Figure 4.6) contains the packages derived from MAGE-OM which

are microarray specific. The packages Array, ArrayDesign and DesignElement describe the


layout of features on a microarray and have not been altered. QuantitationType includes

details of how array data is analysed using any of the available statistical packages, and is

therefore also included in ArrayOM. However, various data types from functional genomics

experiments could be quantified in similar ways, using standard statistical tests. Therefore,

an alternative design would be to include a generic package in BioOM modelling statistical

processing, recording the software used, and the parameters employed. This design was

considered but has not been implemented at this stage. The software for statistical analysis

of microarray data is continuously evolving and, apart from image analysis, there are no

dedicated statistical packages for quantifying proteomics data.

Differences from MAGE-OM

The ArrayBioAssayData package is a modified version of the BioAssayData package in

MAGE-OM. ArrayBioAssayData includes the MAGE-OM derived class BioDataCube that

represents the three dimensions of data: the array features; the parameter that is varied

across a multiple array experiment; and the values calculated for each array feature, such

as the relative fluorescence. BioDataCube captures the order of the three dimensions, and

stores pointers to separate files containing large quantities of numerical data. The three

dimensions of data also exist in a proteomics experiment, and potentially in other functional

genomics experiments, therefore in theory it should be possible to create a generic data

model in BioOM that models the dimensions of data. However, the BioDataCube is possi-

bly too simplistic to capture proteomics data, having only an ordering and pointers to lists

of values in files. In proteomics, a multiple 2-DE experiment may detect certain proteins

present on one gel and not another, calculated by image analysis software. The comparison

of multiple gels can be error prone and spots matched across multiple gels may have scores

assigned to the quality of the match. Spots may also be matched based on experimental

evidence, such as MS data. A generic data model covering all types of functional genomics

experiments would have to be more complex and would require major changes to the rela-

tionships between classes derived from PEDRo. The ArrayBioAssay package contains only

Hybridization, which is linked to classes in BioOM:BioAssay.

4.3.3 ProteomicsOM

The proteomics namespace (Figure 4.7) is a further development of PEDRo and Gla-PSI.

PEDRo design was based upon different principles than the design of MAGE-OM. MAGE-


Protein

Separation

MassSpec

Protocol

Proteome

BioAssay

MassSpec

DataProteinRecord

ProteinData

Figure 4.7: The ProteomicsOM namespace.

OM was intended to be future proof, by including generic attributes in classes, and allowing

data types to be specified using controlled vocabularies of terms, rather than specifying

explicitly in the model which data types should be stored in which position. PEDRo contains

specific named attributes for all the data types that may need to be recorded. In 2-DE, a

gel is used to separate thousands of proteins into individual spots. An image of the gel is

analysed with specialised software that produces output about gel spots, such as an estimate

of volume, area, the coordinates on the gel and many others. PEDRo aims to explicitly define

all of the data types that are produced by current image analysis software and therefore will

require modification in the future. A model following MAGE design principles would have

a placeholder for the first data type and value, followed by the second data type and value,

and so on. ProteomicsOM includes the classes from PEDRo in new packages, however the

classes have been linked explicitly to components in BioOM that allow generic protocols and

parameters to be attached, as required. The following sections describe the classes that are

contained within the six packages of ProteomicsOM.

ProteinSeparation Package

The ProteinSeparation package describes a number of separation techniques, including 2-

DE and liquid chromatography, and is summarised in Figure 4.8. Classes modelling sep-

aration techniques are subclasses of BioAssayTreatment within BioOM. An instance of

BioAssayTreatment can be linked to Protocol, which allows any type of protocol informa-

tion from hardware or software to be added, along with a set of parameters. This mechanism


Gel2D

LCColumn

Physical

GelSpot

Fraction

Separation techniques Separation products

Source biomaterial

BioMaterialBioAssay

Treatment

BioMaterial

MeasurementBioOM

ProteomicsOM

Legend

Figure 4.8: The ProteinSeparation package contains classes that model the relationshipbetween separation techniques and the products of those techniques.

can be used to store additional information about proteome experiments, if the attributes

specified in the part of the model derived from PEDRo do not cover the information that

must be recorded. This mechanism will be particularly important for storing information

about nascent technologies that cannot be covered by PEDRo as it stands. The products of

a separation technique, such as a gel spot, or column fraction are modelled as classes, with

a set of attributes capturing the relevant parameters, and are subclasses of BioMaterial.

The classes Gel2d and LCColumn have a large number of attributes that are not displayed in

Figure 4.8 for clarity (Gel2D records the gel dimensions, pI and molecular weight range and

so on). However, more detailed diagrams displaying all the attributes and the cardinality of

relationships are included in Appendix B. A separation product can become the input for

another separation technique, therefore the model utilises a link from BioAssayTreatment

to BioMaterial via BioMaterialMeasurement to specify the source of material. These three

classes are all contained within BioOM.

ProteomeBioAssay package

The ProteomeBioAssay package contains only one class, GelImageAnalysis, however new re-

lationships have been added to enable the re-use of classes in BioOM:BioAssay in the protein

context (Figure 4.9). These relationships have the following semantics. FeatureExtraction

from MAGE-OM models the process by which data is extracted from a scanned microar-


BioAssay

Treatment

Physical

BioAssay

BioAssay

Image

Channel

Image

Acquisition

GelImage

Analysis

Measured

BioAssay

Feature

Extraction

Measured

BioAssay

Data

BioAssay

Data

targettreatment

BioOM

ProteomicsOM

Legend

Ontology

Entry

format

Figure 4.9: The relationship between the GelImageAnalysis class, in the ProteomeBioAssaypackage, with classes from the BioAssay package in the BioOM namespace.

ray. In ProteomicsOM, GelImageAnalysis is a subclass of FeatureExtraction, and models

the process of analysing a 2-D gel with specialist software. FeatureExtraction is linked

to PhysicalBioAssay, which is linked to the source image (Image), the scanning process

(ImageAcquisition) and information about a specific channel or wavelength at which the

array has been scanned (Channel). These classes can be re-used in proteomics, to refer to the

scanning of a 2-D gel. The Channel class is re-used from MAGE to model the technique of

difference gel electrophoresis, in which a single gel is scanned at a number of different wave-

lengths. Data that is obtained from image analysis is stored in classes linked to BioAssayData

in the ProteinData package. There are two relationship from MeasuredBioAssay, one to

the data model in MeasuredBioAssayData, the other to FeatureExtraction. This en-

ables the raw data, MeasuredBioAssayData, to be linked to the process by which it was

generated (scanning and image analysis are referenced through FeatureExtraction and

PhysicalBioAssay).

ProteinData Package

The ProteinData package models information about gel spots (Figure 4.10). Spot data is

captured in IdentifiedSpot, which has attributes covering data types produced by image

analysis software. The model also captures data from difference gel electrophoresis. Spots

from the single channel image are captured in DIGESingleSpot, and co-migrated spots from


GelImage

Analysis

Feature

Extraction

Identified

Spot

Physical

GelSpotBioMaterial

DIGESingle

Spot

BioData

Tuples

BioData

Values

Multiple

Analysis

Matched

Spots

Physical

BioAssay

BioAssay

Data

BioAssay

Dimension

SpotRatio

BioOM

ProteomicsOM

Legend

Figure 4.10: The ProteinData package.

the composite image are stored in IdentifiedSpot. Spot data is linked to the gel from which

it was produced because IdentifiedSpot is a subclass of PhysicalGelSpot, which is directly

linked to Gel2D in the ProteinSeparation package (Figure 4.8). Spot data is linked back to

the image analysis from which it was produced via BioAssayData and MeasuredBioAssay,

as described above (Figure 4.9). The ProteinData package also captures multiple gel com-

parisons. BioAssayDimension in BioOM models multiple sample comparisons, and is used

in ProteomicsOM by the addition of a link to MatchedSpots, modelling spots matched across

multiple gels to capture differential expression of proteins. MultipleAnalysis is a subclass

of GelImageAnalysis and records the software used for the multiple gel comparison, and

this groups together a set of MatchedSpots in one analysis.

MassSpecProtocol and MassSpecData packages

The packages capturing MS data and protocols contain classes derived from PEDRo (Figure

4.11). MS protocols are modelled by a package called MassSpecProtocol which contains

a class at the top level called MassSpecExperiment. MassSpecExperiment is a subclass

of BioAssayTreatment that can be used to link to the biological substance on which MS

has been performed (in BioMaterial). The substance can be the product of a series of

separation techniques, such as a spot from a 2-D gel. PEDRo-derived classes specify many of

the parameters that are associated with an MS instrument, along with the type of ionisation


MassSpecExperiment PeakList

Peak

MassSpecProtocol Package MassSpecData Package

BioOM

ProteomicsOM

Legend

BioAssay Treatment

PEDRo derived classes modelling MS protocol

BioMaterialMeasurement

PEDRo derived classes modelling database searches

Figure 4.11: The model of MS data and protocols, adapted from PEDRo.

Location

species

modificationType

Protein

ModificationProtein

Ontology

Entry

Database

Entry

BioOM

ProteomicsOM

Legend

Figure 4.12: The ProteinRecord package.

employed, such as electrospray or MALDI (described in Chapter 1). Additional text and

parameters not covered in these classes can be attached using the generic Protocol class

in BioOM, linked to BioAssayTreatment. This ensures that the model can be extended

to include protocols from different MS instrument manufacturers, new software, and new

technologies. A new package, MassSpecData, has been defined to capture the list of peaks

from a trace and the database searches that are subsequently carried out. Proteins identified

by MS analysis and database searches are stored in the ProteinRecord package.

ProteinRecord package

A new package was designed to store details of proteins identified in an investigation (Figure

4.12). The class Protein can be referenced from MS data that has been used for protein


identification. The protein identifier and database URL are captured in DatabaseEntry, and

the species of origin in OntologyEntry (from BioOM:Description). ProteinModification

stores information about modifications that have been observed. The type of modification,

such as glycosylation or phosphorylation, is obtained from a controlled vocabulary and cap-

tured in OntologyEntry. The position of the modification is captured in Location.

4.3.4 A workflow for proteomics

A sample workflow is displayed in Figure 4.13, demonstrating how FGE-OM captures pro-

teomics data. The overview of the experiment is modelled by the class Experiment. If

the experiment includes multiple samples, for example comparing a number of 2-D gels,

the parameter that is varied between samples, such as the different genotypes of groups of

organisms, is attached to classes referencing Experiment. A biological substance must be

processed to extract proteins, and make the proteins soluble in a multi-stage process. This

is modelled by a series of treatments (Treatment) applied to a substance (BioMaterial), to

produce the final soluble mixture of proteins, on which certain separation techniques may be

performed. Protein separation techniques, such as 2-DE or liquid chromatography, are mod-

elled as specialised subclasses of BioAssayTreatment. Each BioAssayTreatment has a mea-

sured source of material, which is captured in BioMaterial and BioMaterialMeasurement.

When data is produced after imaging a 2-D gel, an instance of PhysicalBioAssay is created.

PhysicalBioAssay can be referenced by the class ImageAcquisition, representing the scan-

ning of the gel. 2-DE image analysis is represented by GelImageAnalysis, which is a subclass

of FeatureExtraction. Gel spot data produced by image analysis can be stored in specific

classes in the ProteomicsOM namespace, linked to image acquisition via MeasuredBioAssay.

If MS is performed on a spot excised from a gel, or a fraction from a column, an instance of

BioMaterial is created, modelling the physical entity that is the excised spot or fraction.

MassSpecExperiment is a subclass of BioAssayTreatment, which can be linked to the source

of material. MS data obtained from a particular gel spot is linked directly to data produced

by image analysis of the spot, which is captured in MeasuredBioAssayData.

4.4 Other work: CEBS object model for systems biology data

Subsequent to the development of FGE-OM, a new model covering several functional ge-

nomics techniques has been published [355]. This section reviews the new model, and dis-

cusses how it can contribute to the on-going standards work for FG.


ImageAcquisition

FeatureExtraction

BioAssayTreatment

Physical

BioAssay

Physical

BioAssayImage

Measured

BioAssay

BioMaterial

Measurement

Material TypeDNARNAProteinCell...

Experiment

Treatment BioMaterial

BioMaterial

Gel2D

LCColumn

MassSpec

Experiment

MeasuredBio-

AssayData

GelImage

Analysis

Acquisition

Protocol

Figure 4.13: A workflow for a proteomics experiment involving 2-DE or liquid chromatog-raphy to separate proteins, followed by MS to identify proteins. Diamonds indicate events,rectangles are physical entities and ovals represent data.


SpecializedQuantitationType

Intensity IonCount MassValueType Ratio

StandardQuantitationType

Time Volume DerivedSignal ScorePValue

QuantitationType

isBackground : boolean

ConfidenceIndicator

0..3

1

+confidenceIndicators{rank: 4}

0..3

+targetQuantitationType

{rank: 1}1

Figure 4.14: A subset of classes in the QuantitationType package from SysBio-OM. Darkerboxes are newly created classes in the model, lighter boxes represent classes that have notbeen changed from MAGE-OM.

The CEBS object model, SysBio-OM, has been created with similar goals to FGE-OM

and will support a database for toxicogenomics. Toxicogenomics is the study of the effects

of toxicological compounds on gene and protein expression. The model has been created

by merging MAGE-OM and PEDRo, and adding additional classes to model metabolomics

data. SysBio-OM has been developed with the requirements of toxicogenomics in mind, but

the authors claim that it covers generic types of microarray, proteome or metabolome study.

There is no division of technologies into separate namespaces, as in FGE-OM, but new classes

have been added to the packages in MAGE-OM, and two packages, CommonBioAssayData

and SummaryData, have been newly designed. CommonBioAssayData covers protein ex-

pression, protein-protein interaction and metabolomics data, and SummaryData captures

a textual overview of the data to allow a researcher to decide whether a data set may be

relevant without requiring a full data analysis. At the top level there is very little difference

between SysBio-OM and FGE-OM, both have the classes Identifiable, Describable and

Extendable linked to many of the classes in the model. SysBio-OM is identical to MAGE-

OM (and FGE-OM) in the packages: AuditAndSecurity, Array, ArrayDesign, DesignEle-

ment, BQS, HigherLevelAnalysis and Description. The BioAssayData package is identical

to MAGE-OM, which has been split into two new packages in FGE-OM.


4.4.1 SysBio-OM data model

The SysBio-OM QuantitationType package contains two superclasses at the top level,

SpecializedQuantitationType and StandardQuantitationType. There are several newly

designed classes in SysBio-OM, including PeakAbundance, Intensity, Percentage, Volume,

all of which are subclasses of SpecializedQuantitationType (Figure 4.14 displays a subset

of classes in the package). These classes capture measurement data for various types of FG

experiment. The MAGE-OM derived classes for quantifying microarray data are subclasses

of StandardQuantitationType. SysBio-OM is not restrictive in the kinds of measurement

that can be used for different technologies, and is therefore more generic than the equivalent

section of FGE-OM. FGE-OM captures measurement data for proteomics in specific classes

in the ProteinData and MassSpecData packages, and microarray data in QuantitationType.

The approach taken in SysBio-OM may be superior for this section, and should be considered

as a possible design for an extension to the QuantitationType package in the next version of

MAGE.

The CommonBioAssayData package is a new feature in SysBio-OM (Figure 4.15) to

model proteomics and metabolomics data. Rows of numerical data are represented by

CommonBioDataTuples, and single data points are subclasses of DataElement (boxed in

Figure 4.15). The raw data values are stored in the class QuantitationDimension in the

CommonBioAssayData package. It is not clear how the model captures information about

spots matched across multiple gels.

The Measurement package in SysBio-OM is an extension of the MAGE-OM package,

incorporating many different types of measurement and units that could be used in functional

genomics. In MAGE-OM, and SysBio-OM, each class has an attribute unitNameCV with

an enumeration of values, e.g. the class TimeUnit has an enumeration containing the values:

years, months, weeks, d, h, s, us, ns, fs, other. The option other is included in almost all

classes in the Measurement package and causes problems for developing applications based

on the model because it is not specified how the type other is controlled or used. The FGE-

OM Measurement package does not have any of the specific classes for units but instead has

two links to the OntologyEntry class to specify the type and name of the unit (Figure 4.16).

This design may be superior because the names of units are not hard coded in the model,

avoiding the problem of the attribute other, and it is therefore unlimited in what can be

captured. It is a simple task of incorporating all the measurement types and units into the

MGED Ontology, which already includes most of those added to SysBio-OM.


Figure 4.15: The CommonBioAssayData package from SysBio-OM. The boxed classes arediscussed in the text.


SysBio−OM

FGE−OM

Figure 4.16: The top image shows a small subset of classes from the Measurement packagein SysBio-OM, the lower is the Measurement package in FGE-OM.


Figure 4.17: The Protocol package from SysBio-OM. The boxed classes are newly created.


4.4.2 SysBio-OM Protocol and BioMaterial packages

The Protocol package in SysBio-OM diverges from MAGE-OM (Figure 4.17) by introducing

new packages for different types of protocol (1-D, 2-D gel, MS database search and NMR).

The model does not specify what attributes belong to these classes, therefore this may create

confusion for developers using this part of SysBio-OM. The Protocol package in MAGE-OM

was intended to be independent from technology and can therefore be re-used with no change

for any type of FG experiment. The addition of new classes without attributes does not add

significantly to what can be captured by this part of the model. A new design that can

capture all the information in Protocol of SysBio-OM but remain generic would introduce

a new relationship from the Protocol class to OntologyEntry called protocolType, which

captures the type of protocol, such as 2-D or 1-D gel.

The BioMaterial package in SysBio-OM has several new classes, modelling gel spots and

column fractions, derived from PEDRo (Figure 4.18). These classes also exist in FGE-

OM but reside in the ProteomicsOM namespace in order to leave the BioMaterial package

independent of any technology, however the core functionality of the two models is very

similar for this part. It may be advantageous to put technology specific classes in separate

packages, as in FGE-OM, so that it is easier for developers to understand the intended usage

of the model and focus only on the parts of the model that are required.

4.4.3 SysBio-OM BioAssay and SummaryData packages

The BioAssay package in SysBio-OM is displayed in Figure 4.19. The intended usage of the

package is very similar to a combination of BioAssay, ArrayBioAssay and ProteomeBioAssay

in FGE-OM. A new class, GelFeatureExtraction, models the process of gel image analysis

enabling the classes Image, Channel and ImageAcquisition from MAGE-OM to be re-used

in the proteomics context. Another new class, CommonBioAssayCreation, models techniques

such as a 2-D gel, NMR or a column separation, and links to data acquisition and raw data,

such as images, through the PhysicalBioAssay class. CommonBioAssayCreation functions

in a very similar way to BioAssayTreatment in FGE-OM (although BioAssayTreatment

also exists in SysBio-OM with a different function). CommonBioAssayCreation references

the source material for a treatment through BioMaterialMeasurement in exactly the same

way as in FGE-OM. PhysicalBioAssay has associations with classes modelling column or

NMR data files for metabolomics data (NMROutputFile and ColumnFractionOutputFile).

The SummaryData package is a new development proposed in SysBio-OM (di-


Figure 4.18: The BioMaterial package from SysBio-OM.


Figure 4.19: The BioAssay package from SysBio-OM.


agram not shown) which contains only two classes QualitativeOrSummaryData and

DataInterpretation. These two classes are for adding textual descriptions onto the ex-

periment and it remains to be seen how this differs from what can be captured in the

Experiment package.

4.5 Discussion

The object model, FGE-OM, was created in UML to represent both proteomics and microar-

ray experiments. FGE-OM is based on MAGE-OM and incorporates additional information

from PEDRo and Gla-PSI. There are three namespaces in the new model: BioOM, ArrayOM,

and ProteomicsOM. The BioOM namespace is suitable for describing a generic functional

genomics experiment, encompassing microarrays, 2-DE, histochemistry and others. The

ProteomicsOM namespace was defined from PEDRo and Gla-PSI, and includes classes with

attributes covering 2-DE, MS and data analysis. ProteomicsOM has been integrated with

BioOM, enabling generic protocols, including details of hardware or software, to be attached

to specific classes. FGE-OM uses inheritance from several key superclasses: experimental

techniques are modelled as subclasses of BioAssayTreatment and the products of treatments

are subclasses of BioMaterial. This framework will allow new models describing other tech-

nologies to be added into FGE-OM without significant difficulty, allowing a unified model

for functional genomics to be created in the future. An important use of FGE-OM will be to

generate an XML Schema, to allow research groups to format data in a consistent manner

into FGE-ML, a markup language based on the model. A software toolkit is also required,

based on the microarray software toolkit (MAGEstk), for creating FGE-ML from the object

model.

FGE-OM has been created by merging models that have slightly different design princi-

pals. MAGE-OM was intended to be “future proof” by including generic classes that could

be used for various technologies. Conversely, PEDRo aimed to describe the current status of

proteomics experiments, recognising that future developments would require changes to the

model. The forthcoming versions of both MAGE-OM and the protein model, PSI-OM, will

undergo changes that may bring about the convergence of the different design principles. In

other words, MAGE-OM will include classes for some parts of the model that capture the

standard cases more simply, and PSI-OM will utilise more generic classes to model exper-

imental protocols and biological samples. This issue is outlined in detail in the following

section. We believe that the design process for the next version of both MAGE-OM and


PSI-OM should be guided by the experience of developers who have attempted to create

software based on the two models. It is our view that ontologies should be used extensively,

to reduce the burden on the developers to create an object model that captures all possible

uses of the technology.

FGE-OM demonstrates that the integration of the two current versions of the object

models is feasible. We believe that even if the next versions of the models are developed

independently, the framework described here can be easily evolved, reflecting the changes to

the new object models, and there are significant benefits to capturing both microarray and

proteomic technology in the same structure.

4.5.1 FGE-OM, SysBio-OM and future standards

The CEBS SysBio-OM model is an alternative proposal for an FG data standard. There are

currently no major proposals specifically for metabolomics, however CCPN (A Collaborative

Computing Project for the NMR Community) is fairly well established in the NMR com-

munity and contains an object model and programming interface [113]. The metabolomics

part of SysBio-OM comprises a simple model of NMR data, therefore the CCPN proposals

may be able to contribute to the efforts, and both models should stimulate discussion in the

metabolomics field as to the requirements for a data standard.

In overview of SysBio-OM, new classes have been added to seven MAGE packages to

cover proteomics, and two new packages have been created. The object model has been used

for generating code that acts as a bridge between flat data files and the CEBS database,

and it is planned that future functionality will enable import and export of MAGE-ML and

the future proteome standard, PSI-ML. Another function of SysBio-OM is to act as a pro-

posal for the future development of an integrated data standard across several fields. The

design of certain packages, such as the QuantitationType package, serves this purpose well,

because it is generic, and can capture a wide range of quantitation types. The design of

other packages such as BioMaterial and Protocol mixes the generic approach of MAGE with

technology specific classes. It is our view that this may cause problems because the design of

MAGE will change for the next version, and the PEDRo proposals are changing to become

PSI-OM, as reported in the previous chapter. Therefore, it is likely that a large amount of

work will be required to redesign these packages to reflect the changes to MAGE-OM and

PSI-OM, but this should not be the case for FGE-OM. FGE-OM separates different tech-

nologies with only a few key relationships linking classes in different namespaces, and the


original functionality of MAGE-OM packages is maintained in almost all cases. Therefore,

when PSI-OM is finalised it can be easily merged with the next version of MAGE, using

FGE-OM as a guide. The packages CommonBioAssayData and BioAssay in SysBio-OM

function in a similar way to a combination of ProteinData and the three related BioAs-

say packages in FGE-OM. The CommonBioAssayData package (SysBio-OM) appears to be

more generic than ProteinData (FGE-OM) and utilises inheritance from the superclasses

DataElementDimension, DataElement and QuantitationType for the three dimensions of

data. It remains to be seen how this works in practice, but if a successful implementation

of this part of the model is demonstrated in the CEBS database, this may represent a good

framework for developing a generic data model across all FG experiments. It is likely that

the best design of a standard for FG will take parts of both SysBio-OM and FGE-OM and

a potential framework for this integration is described below.

4.5.2 Developments to MAGE-OM

The division of FGE-OM into namespaces is a simple but important concept that should

make a large object model easier to understand, allowing developers to focus more quickly on

the relevant parts. The next version of MAGE is planned to contain a core of components

that are shared across all types of FG experiment, similar to the BioOM namespace. A

structured description of the purpose of the experiment, the biological samples and the

parameter that is varied across samples is the most important part of the core. All types

of FG experiment can be described in this way and the use of the MGED Ontology, and

extensions to it, will be an essential component. This part of the design ensures that the

purpose of the experiment can be determined very easily by manual or automated inspection

of files rather than having to parse all the information in the document and search for the

differences between the samples. For example, the purpose of an experiment may be to

determine the changes in gene expression between two cell lines, one of which had gene X

knocked out. This information must be easy to search for as it is one of the most crucial

parts of the experimental annotation. FGE-OM, MAGE-OM and SysBio-OM have classes

at the top level, ExperimentFactor and ExperimentFactorValue, which allow the critical

characteristics and differences between the samples under comparison to be specified. These

classes are vital for the purpose of the experiment to be easily understood and therefore the

FG data standard should retain them at the top level. A database should ensure that these

attributes are stored in a way that allows rapid querying and programmatic access to this part


of the annotation. I believe that the next version will benefit from the proposed extensions

of SysBio-OM and FGE-OM. The Quantitation and CommonBioAssayData packages from

SysBio-OM offer a generic framework for capturing FG data and could be incorporated into

the core namespace. In FGE-OM, the simplification of the Measurement package may be

advantageous and should be considered.

The next version of MAGE aims to fix semantic annotation problems with the current

version that have been discovered over several years since its release. PEDRo has been

widely accepted as a draft standard from which the first formal proteomics standard can

be developed. It is vital that PSI-OM, which will supersede PEDRo, utilises the experience

gained from MAGE to avoid the same problems. One general criticism of MAGE-OM is that

for certain concepts it is “over engineered”, in other words, the designers attempted to define

a model that could cover all eventualities but the most common case is captured in a complex

way. Large efforts are required from software developers to create applications that produce

MAGE-ML, and there are still relatively few public databases that offer MAGE-ML input

and output, although this feature is in development for almost all microarray databases.

The next version of MAGE is likely to make greater use of the OntologyEntry class, and

PSI-OM should also utilise ontologies to capture complex concepts. The PSI ontology (PSI-

Ont) will become an extension of the MGED Ontology. PSI-OM will be designed with

the consideration of future integration with MAGE, and the separate mass spectrometry

standards that are under development (as described in the previous chapter).

4.5.3 Integrated standards

The development of an integrated standard requires joint meetings between PSI and MGED.

The two organisations are now committed to co-developing a standard, however the devel-

opment of MAGE will first focus on the creation of a core module, based around similar

principles to BioOM. The last meeting of PSI (Nice, France 2004) was attended by sev-

eral key developers of MAGE, and the previous MGED programming workshop (European

Bioinformatics Institute, Cambridge, UK Dec 2003) had presentations by members of PSI.

FGE-OM was presented at both meetings by the author. It is vital that collaboration con-

tinues between the two organisations. This requires principal investigators to present work

to the wider biological research community to ensure that there is a good awareness and

support for the standard. The Object Management Group (OMG) was involved with the

development of MAGE-OM, providing a framework for checking the consistency of the object


model. The future FG standard should also be vetted through OMG, because while this in-

troduces extra developmental stages, there are likely to be fewer problems that arise once the

model is being used by a large community. Finally, there needs to be a number of workshop

meetings in which developers of MAGE and PSI-OM work together to define a format that

captures everything that is required in the two fields. The format should support functional

genomics, not just microarrays and proteomics, therefore researchers in other parts of FG

research should also be aware of the efforts. It is likely that a data format will only gain

widespread support once several major databases are committed to its development. One

other consideration is that a data format that can encompass a range of functional genomics

techniques may be too bulky for many users who use only a single technique and wish to

utilise a subset of the standard. If the different namespaces are well designed, it will be pos-

sible to derive the single technology data formats from the model, MAGE-ML and PSI-ML,

for transferring results to databases storing only microarray or proteomics experiments.

In the following chapter, the development of an Internet accessible database is described,

which will ultimately form part of a large system for functional genomics. The CEBS

database will also offer access to various types of FG data, and it is likely that several

other systems will come on-line in the next few years. It is important that developers of

different systems collaborate at an early stage to avoid the data incompatibility problems

that have arisen over the last decade in biomedical research, which make the challenge of

data integration so great.

4.6 Conclusion

The chapter has described the development of an object model for functional genomics. FGE-

OM comprises three namespaces that have been created to reflect the different components

in a large biological investigation. BioOM contains twelve packages and ArrayOM contains

six packages that match very closely the structure of MAGE-OM. The third namespace,

ProteomicsOM, comprises six packages that contain classes derived from PEDRo and Gla-

PSI. FGE-OM is intended to demonstrate a potential schema for the integration of microarray

and proteomics data standards, and acts as a proposal from which the next version of MAGE-

OM can be developed. The division into namespaces should allow the model to evolve as

the proteomics and microarray proposals change, and also creates a framework that enables

object models from other types of FG experiment to be integrated. FGE-OM has been

presented to PSI to influence the design of the finalised standard for proteomics, and to


MGED to generate discussion about the next version of MAGE-ML. The model has been

verified against real data by the development of a database implementation that matches

the structure of the object model very closely, described in the following chapter.

Chapter 5

A prototype public database for

proteomics

5.1 Introduction

The main aim of the research presented in this thesis is to improve the facilities for data

sharing and querying in functional genomics (FG). In the previous chapter, the definition of

a functional genomics object model was given, which acts as a proposal for a data standard.

In this chapter, a database implementation is discussed which is capable of storing data from

both microarrays and proteomics. The RAPAD (RNA And Protein Abundance Database)

system is an extension of the RAD microarray database from the University of Pennsylvania,

into which a proteomics component has been incorporated. There are many database systems

for storing microarray data (ArrayExpress, GEO, and SMD summarised in Chapter 2) and

several initial attempts to capture proteomics data, of which SWISS-2DPAGE is the most

well established. However, there is no major public repository that covers both protein

separations experiments and mass spectrometry, and an integration of data from microarrays

and proteomics has not previously been demonstrated in a database.

5.1.1 Extending existing technology

A description of different experiment types used in FG was given in Chapter 1. In overview,

a typical proteomics experiment involves obtaining a set of samples produced under different

conditions and attempting to separate, identify and (possibly) quantify the proteins present

in the different samples. Where proteomics differs from microarray analysis is the range of

different methods that could be used at each stage to get the final result, including: multiple

separation stages, novel techniques for quantifying protein abundance, and identification of-

ten through mass spectrometry (MS) accompanied by database searches. A significant part

140

Chapter 5. A prototype public database for proteomics 141

of the challenge of formally describing this information occurred during the development of

the object models described in the previous two chapters. Therefore, the major implementa-

tion challenges involved creating interfaces for capturing data and protocols, development of

complex query facilities, visualisation of results and data integration. In Section 5.2, there

is a description of databases that exist for capturing proteomics data, however none offer

a complete solution storing 2-DE, MS, and experimental protocols. Therefore, a system is

required that can capture a complete proteomics workflow in a structured format that can be

queried. The decision to extend the RAD system into proteomics rather than develop a new

database from scratch was based on several criteria. Firstly, it is important that microarrays

and proteomics data can be queried side by side, and in conjunction with other functional

genomics data. This will be facilitated by having a shared database schema and user inter-

face, and it will be easier to produce a mapping from an object model, such as FGE-OM,

to a database if the general structure is similar. There is already a close correspondence

between RAD and MAGE-OM, therefore a large part of FGE-OM is already mapped to

a relational representation. Secondly, RAD is a part of the GUS system which is a major

public repository, providing access to genomic sequence data, ESTs, RNA, SAGE [332] and

gene expression data. One of the long term goals of GUS is to incorporate proteomics, im-

munohistochemistry, and cell anatomy components, creating a single access point to many

types of functional genomics data (Figure 5.1). Therefore, RAPAD also serves as a prototype

for developing a proteomics namespace in GUS which, when complete, will provide access to

2-DE, MS and other proteomics data for major web sites such as PlasmoDB [21], ToxoDB

[187] and GeneDB [127]. Thirdly, the time required to develop a large system is significantly

reduced if developing on top of established software, compared with developing de novo. In

summary, RAPAD was developed with several major goals that are explored in the rest of

the chapter:

• RAPAD functions as a prototype for a major public repository for proteomics data,

and ultimately will form part of GUS.

• The implementation was created to provide a framework for developing tools for in-

vestigating the correlation between gene expression and protein abundance, stored in

the same database.

• The current implementation, while serving as a prototype for the future development

of a public resource, also has acts as a platform for supporting on-going proteomics re-


Microarray analysis Proteomics Immunohistochemistry

Legend

Sample Flow

Data Flow

SampleBiological

SampleBiological

ExperimentDesign

SampleBiological

Extract mRNA and

Global protein expression Positional expression profileGlobal gene expression

Extract protein and Apply antibodiesseparate by 2−D gel to samplesapply to microarray

GenomeSequencing

Genome sequence

Data Integration

Statistical processing

Image analysis

clone fragments

Determine sequence,assemble and find genes

Extract DNA and

Overview of Functional Genomics Experiments

Functionally annotated genome

Measure relative levelof mRNA expression to identify proteins

Mass Spectrometryscanning microscopeView samples with a

Figure 5.1: A summary of several workflows in functional genomics to illustrate the require-ments for data integration.


search. RAPAD currently supports two projects at the University of Glasgow: changes

in protein expression of host cells following invasion of Toxoplasma gondii, described

in Chapter 6; and the determination of the proteome of Trypanosoma brucei (Chapter

7). It is planned that the current implementation will be extended and used to manage

large volumes of data produced by the Functional Genomics Facility in Glasgow [293].

5.1.2 The development of RAPAD

The approach taken during the development of RAPAD is as follows. A large database

schema has been designed (174 tables), based closely on the RAD system, and new table

definitions have been created covering proteomics (51 tables). The main advantage of devel-

oping on top of RAD is that it is already MIAME compliant, and a set of tables exist that

correspond to MAGE-OM objects. Therefore, the same tables can be used to store objects

defined in the BioOM (generic) and ArrayOM (microarray specific) namespaces in FGE-OM.

There is software under development for transferring data between MAGE-ML and RAD

(MAGE - RAD Translator, Mr T [202]), which can be adapted to import MAGE-ML into

RAPAD, and extended to map data stored in RAPAD to the finalised functional genomics

data format, based on FGE-OM. The proteomics component was derived primarily from

the PEDRo database schema and the Gla-PSI object model. The PEDRo database schema

matches the PEDRo object mode very closely, much of which contributed to FGE-OM, there-

fore mapping concepts from the object representation (in the ProteomicsOM namespace) to

the RAPAD database schema was not a major challenge. Additional tables were imported

from the Core namespace of GUS for storing login and security data, and from the SRes

namespace of GUS for storing taxonomic information, bibliographic references and contact

details. The different namespaces that exist in GUS are described in more detail in Section

5.2. The RAPAD schema, and the web interface, are freely available for download from

the web site1. Figure 5.2 summarises the correspondence between classes in FGE-OM and

tables in RAPAD. There are additional components that are modelled in FGE-OM using

ontologies, which are stored in RAPAD using other GUS-derived tables, described in Section

5.3.1.

1The web site of the Functional Genomics Experiment Object Model: FGE-OM www.gusdb.org/fge.html.


Figure 5.2: A mapping from classes in FGE-OM to database tables in RAPAD.


5.1.3 Chapter guide

The rest of the chapter is structured as follows: previous work on databases and ontologies

is described in the next section. The methods used to develop the schema, the user interface

and perform data integration are outlined in Section 5.3. Section 5.4 describes the current

implementation and the database schema, using the web interface to illustrate examples.

Section 5.5 includes a discussion of the technology, and how it can be extended in the future.

5.2 Previous work

5.2.1 GUS

The GUS database developed at the University of Pennsylvania is an established system

storing functional genomics data. GUS provides the database facilities for several major web

sites that allow access to genome and transcriptome and EST data for various organisms,

including Plasmodium falciparum [248], Trypanosoma brucei [127], Toxoplasma gondii [320]

and several others. GUS also supports AllGenes [10], which is a gene index for human

and mouse created from collections ESTs and mRNA sequences (described in more detail

in the following chapter). GUS consists of several namespaces that have been developed

independently and added into one large schema. The tables comprising the namespaces can

be viewed in the GUS Schema browser [141], and fall into following categories: Core, App,

DoTS, RAD, SRes and TESS. Core stores details of users, projects, and information about

a specific database implementation. The App namespace stores help pages and information

specific to the application that is using GUS. The DoTS database supports the AllGenes web

sites and consists of tables for storing details of genes, mRNAs and ESTs. The different types

of sequence data can be associated together, for example if an EST and mRNA sequence

both arise from a single gene sequence, all entries can be linked together through a single

entity (a DoTS gene). This enables a user to map different types of database identifier back

to the same gene. The RAD database is the gene expression component of the database, and

various features are described in the rest of the chapter. SRes contains a variety of tables for

storing contact details, taxonomy, phenotype, general ontologies, associations to the Gene

Ontology and others. TESS (Transcription Element Search System) stores information about

transcription factor binding sites and can be used for predicting new sites according to a

statistical model.


5.2.2 Proteomics database

There are several existing proteomics databases that are described in this section. The most

established proteomics database is SWISS-2DPAGE [153] that offers static gel images that

can be clicked on to access protein data, and allows searches on proteins by accession number,

text search over descriptions and the author of the study. The data format used by SWISS-

2DPAGE was described in Chapter 3. There are a number of other systems developed using

the software available from SWISS-2DPAGE that offer similar capabilities. In general, 2-D

gel databases on the Web tend to offer only static pictures of 2-D gels with links to pages

about spots identified on the gels, but at best have very limited search facilities, and little

or no information about experimental protocols or biological samples. The data from these

systems can usually only be queried by manual browsing of web pages.

The GELBANK system [20], also described in Chapter 3, has facilities for searching,

and has a visualisation system for gels that allows zooming, but stores only very basic

information about experimental protocols: the gel stain, a brief description of the starting

sample and a description of the first and second dimension separation. GELBANK does

not store mass spectrometry data, and SWISS-2DPAGE only stores basic MS information

without any information about the quality of the match, therefore in these systems it is

difficult to place a confidence value on the correct identification of a protein.

There are several systems that manipulate gel images and enable Web publishing of data.

The GelScape system is one example, which allows researchers to register gel images and lists

of identified spots, storing data in text files [365]. GelScape is not supported by a DBMS

therefore cannot offer complex query facilities.

There are several commercial systems for storing MS data, such as RADARS [106], how-

ever these systems tend to be very expensive and are therefore not feasible for many labora-

tories. LIMS (Laboratory Information Management Systems) applications are also available

from software companies that have facilities for storing 2-D gels, but are usually only acces-

sible at high cost and are generally geared toward sample tracking in a generic laboratory

experiment, rather than specifically capturing protein separations and mass spectrometry.

5.2.3 Ontologies

GUS makes extensive use of ontologies to store concepts that cannot easily be represented in

a database schema. The MGED Ontology (MO) was described in the last chapter in terms

of its use in data standards, however it is also important in the database context. In RAD,


MO is used to populate data entry forms in the interface and we have followed this in the

design of RAPAD. This is particular important for storing the characteristics of biological

samples and details of experimental protocols. The NCBI Taxonomy is stored in the SRes

part of GUS and can be referenced in RAD, and RAPAD, for storing the species of origin

of a sample.

5.3 Systems and Methods

5.3.1 Schema development

The database schema was created using a database design application (PowerDesigner

9TM[250]). The schema was developed manually, guided by information from the object

model, FGE-OM, rather than using an automatic conversion application. There are currently

no reliable methods of automating schema evolution, and all schema generation packages,

where a database is generated from an object model, assume that the database is newly cre-

ated. The database schemas for RAD [262] and PEDRo [242] were imported into the design

application and new tables were created as required, to store objects that were defined in

the Gla-PSI model. Section 5.4 describes the constituents of the database in detail but in

overview it has the following structure: tables covering an overview of the experiment, bio-

logical samples and protocols are all re-used from RAD. Other parts of GUS (SRes and Core

namespaces) have been installed alongside to capture data privacy, bibliographic references

and species data. Specific details of protein separation techniques and mass spectrometry

are stored in tables similar to the PEDRo specifications. Tables covering image analysis and

2-D gel data are derived from Gla-PSI. In the following chapter there is a description of

the integration of microarray and protein abundance data. The microarray data is stored

in a set of temporary tables within RAPAD rather than tables derived from RAD due to

time constraints, and because the aim of work is to demonstrate that integration of results

is possible, not that microarray experiments can be stored in RAD, which has been well

documented in the past [302]. The complete RAPAD schema is displayed in Appendix C.

5.3.2 Interface development

The interface for loading data (the RAPAD Study-Annotator) has been developed from the

existing RAD Study-Annotator [202] and functions within a web browser. The query inter-

face (RAPAD Querier) has been created de novo, after consultation with researchers about


their requirements for publishing, visualising and querying data. A significant period of time

has been spent testing the interface with both real and artificially generated data by database

developers and bench researchers. Feedback arising from interviews with researchers has en-

abled improvements to the interface, such as providing help pages, and adding comments on

data entry forms to make the interface easier to use. The interface allows data and protocols

to be entered manually but an option also exist to load data about gel spots and protein

records in bulk. Two file formats have been specified for bulk loading, following consultation

with researchers and principal investigators in the Functional Genomics Facility at Glasgow

University.

5.3.3 Data integration

One of the main goals of developing RAPAD is to facilitate the integration across different

types of functional genomics data. The diagram in Figure 5.1 displays several different types

of experiment and the requirements for data integration. One of the goals of RAPAD is

to test whether core RAD tables can capture experimental protocols and sample tracking

information from other types of FG experiments. It is planned that the proteomics compo-

nent will become part of GUS to produce a system that is capable of integrating all the data

types shown in Figure 5.1. In this section, the issue is explored of how integration can take

place in theory across a complete system for FG. In the following chapter a specific example

is given describing integration across microarrays and proteomics for a project supported by

the current implementation of RAPAD.

Database identifiers for proteins

The core data point in a proteomic investigation is an identified protein, which may have been

quantified, such as a volume ratio between different conditions produced by image analysis, or

by differences in fluorescence or radioactivity, as measured from a labelling experiment. The

database identifier of the protein will depend upon the organism being studied, but often the

identifier will point to a record in a sequence database, such as GenBank. For organisms with

incomplete genome sequences, mass spectrometry data may be searched against predicted

translations of the latest release of the genome data or EST databases. These data sets

contain only partial, or inaccurate protein sequences, and many sequences have no record in

GenBank. In these cases integrating data points is an even greater challenge.


Identifiers for microarray features

The data points in a microarray experiment are measured for every clone deposited on the

array, or oligonucleotide position, collectively known as the features. The type of identifier

given to an array feature, depends on the type of microarray. The features on arrays produced

by cDNA deposition are identified by unique IDs supplied by the array manufacturer, which

often have entries in the manufacturer’s own database, and may be supplied with GenBank

identifiers of the cDNA or EST records from which the clone were produced. Affymetrix

arrays are supplied with their own unique identifiers of the features, and GenBank identifiers

can be obtained from the company’s web site using a software toolkit. The data values

associated with every clone are usually a single fluorescence measure, or a ratio of fluorescence

from scanning an array at two different wavelengths.

Matching different types of identifier

Immunohistochemistry data sets tend to be far smaller than microarrays or proteomics,

arising from antibodies raised against particular proteins that usually have a known entry

in GenBank. It is desirable that the data points from all experiments are related back to

the genome, gene and protein databases, allowing a user to search for particular genes, and

to discover studies in which modulated expression or localisation data exists. To realise this

goal there must be mapping across the identifiers from: i) protein sequences identified by

MS, ii) cDNA or EST sequences on microarrays, iii) protein records in immunohistochemistry

experiments, and iv) the DNA and protein records in GenBank, SWISS-PROT and other

major databases. Some protein records in GenBank have a link to the corresponding gene

record, however there is usually no direct link to the corresponding record for the cDNA

sequence that has been used on a microarray. The only robust method for mapping across

all identifiers is to perform sequence similarity searches.

The AllGenes web site provides access to the Database of Transcribed Sequences (DoTS)

at the University of Pennsylvania [10]. DoTS is a part of GUS and has predefined mappings

for human and mouse sequences for the different identifiers that exist for sequences from

microarrays, EST, cDNA and genome databases. In the following chapter, a process is

described outlining how microarray and proteomics data have been integrated using DoTS,

from studies of a parasitic infection of human cell culture. However, for organisms other

than human and mouse, the following algorithm is required to integrate data:

1. Create a new database of sequence clusters that will comprise entries containing clusters


of different database identifiers that correspond to individual gene sequences.

2. For every protein record matched by MS data, or identified by another method in a

proteomics experiment, obtain the protein sequence, example: ABC.

3. Perform a sequence search with ABC against a translation of the most recent version

of the organism’s genome database. For sequences that match a gene exactly, create a

new database record for this cluster, including the DNA or protein sequence.

4. For sequences that do not match anything in a genome database, search against EST

or cDNA databases for very close or exact matches. Assign all the identifiers that can

be found to a new record in the cluster database.

5. Retrieve cDNA or clone sequence data from microarrays and search against the genome

database. For exact matches check to see if a corresponding entry exists in the cluster

database. If the entry exists, add the microarray clone ID to the record in the cluster

database, otherwise create a new entry.

6. Perform the same process for immunohistochemistry, or other FG data points, to

retrieve the most closely matching sequence entries, and add records to the cluster

database.

7. Integration occurs by performing queries over the cluster database to find proteins

and microarray clones that have been assigned to the same cluster. For these records,

quantitative data may be comparable if the starting samples from different experiments

have been treated in the same way. However, substantial statistical analysis is required

to determine the validity of correlating protein and mRNA abundance data [143]. This

is an issue which will require significant future efforts from statisticians working with

bench biologists.

8. An entry should be created for every known gene in the cluster database, containing

the identifier from GenBank, Swiss-Prot, PIR and the genome database specific to that

organism.

9. When a gene is highlighted by a user, there should be an option to perform a query

of the cluster database to find all the other sequence identifiers that the gene has been

assigned. A further query can then be performed to retrieve any FG experiments in

which modulated expression, localisation or interaction data exists.


The process described above allows different types of FG experiment to be integrated

at the level of individual gene and protein records, however, one problem that this does

not address is the challenge of different protein forms. Proteomics experiments can reveal

differentially modified proteins, different splice forms and protein complexes. In these cases

it possibly does make sense to attempt to correlate results with quantitative changes at the

mRNA level, but qualitative results may be of interest. For example, if three spots appear

on gel A, produced by different phosphorylation states of protein X, and only one gel spot

appears on gel B, from a different condition, it would be difficult to attempt to correlate the

total difference in protein volume with changes in the amount of mRNA between condition A

and B as measured by microarrays. However, it may be interesting to note that microarray

analysis reveals up-regulation of gene X (in condition A), and 2-DE analysis reveals three

differentially modified forms.

The integration of different FG experiments is a major database challenge, however it also

raises issues in data visualisation. If a set of good visualisation tools are created, coupled with

complex query facilities, a researcher can begin to build a global picture of gene and protein

regulation, and the changes that occur during disease. Visualisation issues are addressed in

the following section.

5.3.4 Visualisation

Proteomics data is visualised in RAPAD using several different methods. Information about

experimental protocols, samples and bibliographic references can be viewed in web pages

created dynamically within the RAPAD Study-Annotator. 2-DE data is viewed in a Java

Applet [168] that resides within a web browser. The Gel Viewer has controls for navigating

around a gel and zooming to unlimited resolution, and manages several gels simultaneously

using tabbed panes that the user can switch between. Individual records of mass spectrom-

etry data are viewed using the web interface supplied with the MASCOT software [207],

however certain parts of the results are also summarised in a tabular format within HTML

pages.

There are several issues with the visualisation of 2-DE data in the Gel Viewer. In theory

a Java Applet should load in any web browser that has Java installed, however this is not our

experience. The Gel Viewer is loaded within an HTML page that is created dynamically by

PHP code [246]. The PHP code writes out parameters encoding gel spots and proteins for

the Applet to read in, however there appears to be a flaw in the way in which different web


browsers load the Applet. When the number of parameters becomes large, for example with

four gels each with several hundred proteins, the Applet starts before all the parameters

have been read in, leading to missing spots, or gel images not displaying correctly. This

problem does not occur for Internet Explorer version 6 but is a major problem in Netscape

and Mozilla. There does not appear to be a simple solution to this bug. Therefore in the

future, the Gel Viewer may need to be re-coded, to enable greater flexibility with regard to

accessibility of the database.

There are several alternatives for developing applications that function within a web

browser, including Macromedia Flash [199], Javascript [171] and Scalable Vector Graphics

(SVG) [281]. However, each of these technologies has certain limitations for complex appli-

cations, and there are few good examples of their use in the life sciences. SVG is useful for

drawing regular shapes and objects, such as graphs or diagrams but is not suitable for load-

ing high resolution images such as 2-D gels. Javascript is used widely in web applications but

is not suitable for developing complex software. For example, Javascript could not be used

for zooming on a gel image without loading a new web page with each zoom factor, which

would be too slow. An example of an application developed using Flash is the Human-Mouse

Homology Map at the NCBI, which provides a visualisation of the homology between mouse

genes along a specified human chromosome, or vice versa [158]. For large chromosomes, the

visualisation of genes is difficult to read and the software runs too slowly, which may be

an implementation problem or a limitation of the technology. A preferable solution for the

continued development of the Gel Viewer may be to create a stand alone desktop application

that database users must download, using a technology such as Java Web Start [170].

5.3.5 Unique identifiers

It is essential for archiving data that protein records in RAPAD are assigned a unique ID

that persists even if a new version of the database is created. Protein records in the current

implementation are assigned a sequential numerical ID that is managed by the RDMS. The

ID number can be used to query the database if it prefixed with RPD, and suffixed with the

database version number (1 ). For example, record 101 can be queried via the web interface,

as long as it has been specified as a public record, using the string RPD101.1. This system

is not ideal from the point of view of security, and would be improved by creating a record

in the DatabaseEntry table for each protein that is publicly accessible, with an identifier

that is unrelated to the RDMS identifier.


There is an effort to create universal public identifiers in the IBM Life Science Identifier

(LSID) project [296]. The aim is to create identifiers that are globally unique and persistent,

therefore they can never be re-used and will outlive the objects that they identify. An

LSID is created by concatenating the web address of the organisation, the database name

followed by the type of identifier, the identifier itself and finally the database version number,

separated by colons. The examples below demonstrate how a uniform resource name (URN)

is formulated for three major databases.

URN:LSID:ebi.ac.uk:SWISS-PROT.accession:P34355:3

URN:LSID:rcsb.org:PDB:1D4X:22

URN:LSID:ncbi.nlm.nih.gov:GenBank.accession:NT_001063:

An equivalent LSID for RAPAD would be:

URN:LSID:brc.gla.ac.uk:RAPAD:RPD101:1

A foreseeable possible problem is that RAPAD does not currently have a permanent

home and the web address is likely to change. However, this problem can be avoided as

long as the Bioinformatics Research Centre in Glasgow (brc.gla.ac.uk) does not develop

an alternative database called RAPAD, which is unlikely. The LSID project can be easily

implemented if databases adhere to the guidelines and provide programmatic access to the

database, accepting the LSID of an object as a query string.

5.4 Implementation

RAPAD has been deployed in Oracle 9i [235] as part of a standard three tier architecture

(Figure 5.3). A web interface has been created, written in the PHP language [246]. The

database schema is large (174 tables), and has the capability to store information from a

wide range of technologies. Therefore, web pages have been developed for data capture as

they are required by the users. In this section, an overview of each part of the database is

described, using examples of data capture in the Study-Annotator to illustrate graphically

how the database has been implemented.

A workflow is displayed in Figure 5.4 summarising the stages at which data is entered

by the user. There are several stages at which queries are made of the database to retrieve

terms from an ontology to populate drop-down boxes in the user interface, described in more

detail below.


DatabaseOracle

ServerImage

ServerMASCOT

PHP

Interface generation

MASCOT Results

User InterfaceMiddlewareData Storage

Java

Querier

Study−Annotator

Gel Viewer

Batch queries forspecific investigationsand Gel Viewer code

Perl

Scripts suppliedwith MASCOT

Figure 5.3: The architecture of RAPAD.


Login Page

Study, contactsand references

BioSource andsolubilisationprotocol

image analysis

2−DE, image,scanning and

Visualise 2−DEin Gel Viewer

1) Query DB forontology terms

2) Add details to DB

1) Query DB forontology terms

2) Add details to DB

and spot detailsQuery for gel

Data entry

Data entry

Data entry

Check usernameand password

Add details to DB

Data entry

User interaction RAPAD Study−Annotator Oracle database

1)

1)

2)

2)

and bulk loadin two files

Figure 5.4: The user interaction with RAPAD for entering a 2-DE experiment.


5.4.1 Data privacy

The first entry point for the RAPAD Study-Annotator requires users to login, and select

their data privacy preferences (Figure 5.4). Essentially, this requires selecting the Project,

Group and Study settings. The Project setting specifies the database namespace in which the

data will be stored, which will be required when RAPAD is integrated with GUS. The value

of Project is set to “RAPAD” in the current implementation. The Group setting is the top

level for dividing researchers into different classifications. It is envisaged that each laboratory

will have its own Group value. The Study value is a further specification, and captures a

complete investigation, consisting of many different 2-DE experiments. For example, the

entire Trypanosoma brucei proteome investigation is currently captured as one study. All

tables in RAPAD have the following attributes, to ensure data integrity:

MODIFICATION_DATE NOT NULL DATE

USER_READ NOT NULL NUMBER(1)

USER_WRITE NOT NULL NUMBER(1)

GROUP_READ NOT NULL NUMBER(1)

GROUP_WRITE NOT NULL NUMBER(1)

OTHER_READ NOT NULL NUMBER(1)

OTHER_WRITE NOT NULL NUMBER(1)

ROW_USER_ID NOT NULL NUMBER(12)

ROW_GROUP_ID NOT NULL NUMBER(3)

ROW_PROJECT_ID NOT NULL NUMBER(3)

ROW_ALG_INVOCATION_ID NOT NULL NUMBER(12)

The attributes ROW USER ID, ROW GROUP ID and ROW PROJECT ID are assigned the foreign key

linking to the corresponding record for each user, group and project for every record that is

entered in the database. Additional tables exist for linking information to the Study in which

it belongs. Data security issues are discussed in more detail in Section 5.4.7.

5.4.2 Studies, protocols and contact details

Bibliographic references, experimental protocols and contact details can be entered in RA-

PAD, and are not linked to any particular study, allowing their re-use in many different

contexts. The web page for entering Protocol data (Figure 5.5) has drop-down menus for

selecting the type of protocol, options include nucleic acid extraction, protein solubilisation,


Figure 5.5: The interface for entering protocol information into RAPAD.

gel stain, and so on. These options are populated from the OntologyEntry table, and are

used for linking the protocol to the correct page in the Study-Annotator. For example, any

protocols entered with the option gel stain will appear as options for linking to a staining

protocol in the 2-DE Assay page of the interface.

A set of web pages exist for capturing the intention of the study as a textual descrip-

tion, and also a set of parameters can be entered, with a different parameter value for each

experiment in the study. For example, in a time course experiment, samples from 1, 2, 4,

6, and 24 hours post infection are each analysed by 2-DE. This information can be cap-

tured in RAPAD, linking the parameter to the 2-DE details, and in turn the 2-DE details

can be linked to a description of the protein sample (BioMaterial). The source of mate-

rial can be entered in RAPAD, linked to contact details for the provider of material, the

species of origin, type of material (e.g. DNA, protein, cells, generated from entries in the

OntologyEntry table), and a general description (stored in the table BioSource, Figure

5.6). A series of treatments can be applied to convert a source of material (BioSource) to

a substance (BioMaterial), such as a protein mixture, which can be linked to a 2-D gel

record. Alternatively, BioMaterial could store labelled mRNA that has been hybridised to

a microarray. Treatments correspond to basic laboratory procedures such as additions of


Figure 5.6: A web page for specifying sources of biological materials

solutions, washes, incubations and many more, allowing a researcher to store a structured

definition of lab protocols, such as the extraction and solubilisation of proteins from cells.

These features have been inherited from RAD, however additional tables have been added to

the database schema: StudyAssayProt, StudyDesignAssayProt and so on, for linking study

and biomaterial details to the corresponding proteomics experiment (table ProteomeAssay)

rather than a microarray (table Array, Figure 5.7).

5.4.3 Protein separations

RAPAD has capabilities to store information describing a series of protein separation treat-

ments (Figure 5.8), although the focus of the current implementation is 2-DE. Every experi-

ment type has an entry in a specific table (e.g. Gel2D, Gel1D or LCColumn) and an entry in a

generic table, BioAssayTreatment. BioAssayTreatment can be linked to a measured input

of a biological material, captured in AnalyteMeasurement and a view2 (BioMaterial) on

the table BioMaterialImp. The output of each treatment produces a set of entries in spe-

cific tables, such as PhysicalGelItem and Fraction, which are linked to BioMaterialImp,

enabling a series of treatments with specified inputs and outputs to be captured in a struc-

2A view in SQL is a single table that is derived from other tables. A view may not be physically stored inthe relational schema but is a notation representing certain information that is frequently required [89].


Proteome

Assay

StudyDesign

AssayProt

StudyAssay

Prot

StudyFactor

ValueProt

Study

StudyFactorStudyDesign

Gel2DBioAssay

Treatment

Assay

StudyDesign

Assay

StudyAssay

StudyFactor

Value

Study

StudyFactorStudyDesign

Array

Proteomics Microarrays

Figure 5.7: A summary of the database schema for storing information about the design of astudy. Three RAD derived tables have been replicated in the RAPAD schema with changesto one relationship, referencing ProteomeAssay rather than Array. Each box represents adatabase relation (table) and arrows represent a relationship between two tables, such asGel2d has a foreign key from BioAssayTreatment.

tured format. The BioAssayTreatment table has a relationship to Protocol, which enables

additional protocol information to be attached to a technique, if the attributes specified in

the table specific to the technique do not cover what is required.

5.4.4 2-D gel data

The details about a 2-D gel are entered on the 2-DE Assay page (Figure 5.9). The parameters

of the gel are entered in the Gel2D table, and the table ProteomeAssay stores the name of

the experiment and a link to the experiment’s operator. ProteomeAssay is used to link

indirectly to protocols for the first and second dimension separation, protein solubilisation

and staining (all stored in the table Protocol). Following input of 2-DE data, scanning

information can be entered into the table ImageAcquistion, capturing: the type of scanner

used, the operator, the date, a protocol if required and any associated parameters with values.

Multiple scans can be entered, each associated with a particular channel or wavelength, which

can also be used to store a difference gel electrophoresis experiment, in which a single gel

is fluorescently labelled and scanned at two or three wavelengths. Each scan is assigned

a unique name that appears on the Gel Image Analysis page. On this page, the user can


Gel2D

Gel1D

FractionLCColumn

Physical

GelItem

Link to image

analysis data

Source Product

BioAssay

Treatment

Analyte

MeasurementProtocol

BioMaterial

Imp

Figure 5.8: The database schema for protein separation techniques and the relationships tothe BioAssayTreatment table.

enter a protocol and name of the software used to analyse the gel image (inserted in the

GelImageAnalysis table). The image scan must also be associated with a gel image that is

stored on the file system, and the URI (Uniform Resource Indicator) of the file is updated

in the ImageAcquistion table. Two further pages exist for bulk loading data: gel spot files

and protein files. Spot data files contain lists of spot ID numbers, coordinates and volume

values (calculated by image analysis), which are stored in the tables IdentifiedSpot and

PhysicalGelItem. Each IdentifiedSpot record links to the image analysis that produced

it (in GelImageAnalysis). Data files can also be loaded that contain tab delimited data

about the proteins to which spots have been matched, including: the protein name, species,

MW (molecular weight), pI (charge), links to external databases, and a link to MS data on a

separate file server. The data is loaded in batches, linked to the correct spot using the table

AnalyteMeasurement, linked to BioMaterial and PhysicalGelItem (Figure 5.11).

The schema design for this section is fairly complex (Figure 5.11), however this reflects

the nature of a proteomics experiment: a spot may be excised from a gel and could be used in

a number of different experiment types: MS, chromatography, or additional gel separations.

Therefore, an entry exists to model a gel spot as a physical entity (a BioMaterial), to enable

further treatments on the spot to be captured. A gel spot does not have an identifier until

the gel image has been analysed and spot data has been input. Therefore, to correctly specify

a gel spot, a record is required in the IdentifiedSpot table (from image analysis), in the


Image acquisition

Image analysis

2−DE assay

Figure 5.9: Screenshots for loading 2-DE, scanning and image analysis data into RAPAD.The scanner image is obtained from http://biology.berkeley.edu/EML/scanner.jpg, the im-age analysis software is a screenshot of DeCyderTM[74].


BioAssay

Treatment Gel2D

Image

Acquisition

Identified

Spot

DIGESingle

Spot

Channel

GelImage

Analysis

Physical

GelItem

Matched

Spots

Multiple

AnalysisProteome

Assay

Figure 5.10: The tables present in the database schema store data from gel spots, imageanalysis and the scanning of a 2-D gel. The database also records information about spotsmatched across a number of gels in MatchedSpots and MultipleAnalysis, and differencegel electrophoresis data in the table DIGESingleSpot.

Protein

Record

MassSpec

Experiment

Identified

Spot

PeakList

BioAssay

Treatment

Physical

GelItem

BioMaterial DBSearch

ProteinHit

Analyte

Measurement

Direct link to top protein hit

Figure 5.11: The database schema for linking protein records to gel spots. A protein recordis linked to the gel spot via the raw MS data and database searches that have performed foridentification. A direct link from the gel spot (PhysicalGelItem) to the protein record hasalso been implemented to enable fast queries.


PhysicalGelItem table (referring to the actual spot on the gel) and in the BioMaterial view

when required, to enable the gel spot to be linked to additional treatments in the database

(via BioAssayTreatment). If spots corresponding to the same protein have been matched

across gels, this information can be captured in the table MatchedSpots, and spots appear

with a different symbol in the Gel Viewer.

5.4.5 Mass spectrometry and external databases

Mass spectrometry data can be stored in tables derived from the PEDRo database schema.

The tables are linked to rest of the schema via the BioAssayTreatment table (Figure 5.12).

BioAssayTreatment references a source of biological material, enabling MS data to be linked

to a protein sample arising from a series of separation techniques, which could be a gel spot.

However, in the current implementation only a URI is stored in the DBSearch table, linking

to the results of searches with MS data, generated using the MASCOT software. Certain

data are extracted automatically from the MS results using a script developed by Karl

Burgess (IBLS, University of Glasgow), and stored in the ProteinHit table, such as the

match score, e-value, the number of peptides hit in a sequence, and the sequence coverage3.

These factors enable the quality of match to be determined, allowing researchers to exclude

data from certain views in the interface, if the MS data does not conclusively identify a

protein. The table ProteinRecord stores properties of each protein in the database, such as

MW, pI, the protein’s name and a reference to the species of origin, stored in the SRes Taxon

table. The table ProteinRecordEntry links a record to external database entries, stored in

DatabaseEntry. DatabaseEntry captures the database accession number, and has three

external links to OntologyEntry in which the database name, database URI and database

version are captured. In this way, a protein record can be linked to any external database

required, as long as it is Internet accessible.

5.4.6 RAPAD Querier

An important feature of a database system for functional genomics is the ability to perform

complex queries. The current RAPAD implementation includes a set of tools that enable

data to be visualised and queried, to support biological research. The use of the query

interface, the RAPAD Querier, is outlined in Chapters 6 and 7 with regard to two biological

investigations: the proteome of host cells when invaded with the parasite Toxoplasma gondii

3Sequence coverage is the percentage of the protein sequence that is covered by the peptides that havebeen matched.


BioAssay Treatment

MSExperiment

Tables for protocol Tables for database searches

PeakList

Peak

ProteinHit

ProteinRecord

ProteinModification

Physical GelItem

Figure 5.12: The database schema for mass spectrometry, adapted from PEDRo.

and the determination of the proteome of Trypanosoma brucei. Specific features of the

interface have been geared towards providing the queries required by the two projects, to

solve specific goals. An overview of the main features of the RAPAD Querier is given in the

rest of this section.

There are several different methods for accessing data in RAPAD. Firstly, for researchers

annotating data in a particular study there is an option to load any of the gels in that study in

the Gel Viewer (Figure 5.13). Researchers can also perform a search to find all 2-D gels within

their Project-Group preference settings, within a particular study, performed by specific

operators, or containing a certain protein name. The Gel Viewer has been implemented as

a Java Applet [168], an application that runs within a web browser, thereby enabling any

users to view data without needing to install new software (except Java). The Gel Viewer

is capable of loading multiple gels simultaneously in different tab windows. Within the Gel

Viewer basic searches can be performed to find particular protein names, label a specific spot

by ID number, or highlight a set of proteins with a range of molecular weights or pI values.

Controls exist for moving around the gel and zooming on particular regions for highlighting

subtle differences in spot patterns between two or more gels. Once the Gel Viewer has been

loaded, there are a set of options for viewing data about a single gel: 1. Display All Spots,

2. Display All Proteins, 3. Search This Data, 4. Display Gel Details, 4. Show Gel Info, 5.

Show Microarray Data, and if two gels have been loaded 6. Show Matched Spots.

1. There is an option to view a table, created dynamically in HTML, showing all the discrete


Figure 5.13: A screen shot of the 2-D Gel Viewer that provides search capabilities overprotein data and links to MS results. There is a feature for loading multiple gels in differenttabbed windows, for example for comparing gels for samples for different conditions.


Figure 5.14: A form for entering annotation about a gel spot and linking to protein records.Links are provided for adding data about protein modifications and updating the proteindetails.


Figure 5.15: A table displaying all the proteins identified on a single gel.

spots that have been identified on a gel. Hyperlinks exist for each spot ID number which

load the specific record about each spot (Figure 5.14), which enable additional annotation to

be entered, and for linking a gel spot to protein data, such as MS information. There is also

a page for entering the type, location and description of post-translational modifications.

2. Similar output is provided displaying only the gel spots that have been matched to protein

records (Figure 5.15).

3. An option is given for loading an HTML form that enables searches to be performed over

a data set arising from a single gel. Search criteria include approximate matches to multiple

protein names entered, ranges of values for molecular weight, pI, and statistics from MS

data about the quality of a match (Figure 5.16). Boolean “AND” or “OR” searches can be

performed, and the resulting data can be ordered by any of the above criteria. The results

of a search are displayed in a table on a web page, with links to the source data, and an

option exists for highlighting the spots found by a search in the Gel Viewer.

4. Clicking the Display Gel Details button loads a page displaying the parameters and

protocol employed for the gel. There are links to separate protocols for the first and second

dimension separation, staining, and protein solubilisation. If the gel has been linked to

information about a biological sample (BioSample) or source of material (BioSource) in


Figure 5.16: The query interface for searching for specific protein records.

RAPAD, this information is displayed, along with the protocol for gel image scanning and

gel image analysis.

5. If the gel has been associated with a microarray study, a table can be loaded displaying

all the proteins on the gel, alongside the microarray expression values for the corresponding

gene. This feature is illustrated in the following chapter.

6. RAPAD has options for loading information about spots on different gels that correspond

to the same protein. Clicking the Show Matched Spots button loads an HTML page display-

ing all the proteins on the two gels. Spots that match across the two gels are highlighted

in bold, and if spot volume information has been entered, the ratio of volumes is displayed,

corresponding to an approximation of the change in expression of the protein between the

two conditions.

An important feature is the ability to summarise all the data within a study, especially if it

results from proteins identified on a number of different gels. An option exists to classify gels

within a study into two groups, for example one set of gels from “disease” samples, versus a

set of “normal” samples. The proteins identified in the two groups appear in separate tables,

with links to the source protein records, and an option to load the Gel Viewer highlighting

selected spots on the gel.


5.4.7 Public data access

The standard interface contains pages displaying protein spot records that can be updated,

intended for researchers to modify and insert new data as required. Clearly, this system

is not suitable for external access, even if updates could only be performed by researchers

with a specific login, because it would be difficult to ensure that data was always secure.

Therefore, a separate interface has been created allowing anyone to view publicly acces-

sible data in RAPAD, which only has views of the data, with no facilities for updating.

Data can be accessed in this interface through a page that displays all the public studies

in RAPAD, giving the option to load particular gels. A query page is also available to

search for particular proteins identified on any gel in the public system. The page displaying

protein records in RAPAD can be queried by a web link, thus providing basic program-

matic access. The following URL can be used to link to any record on the public system

(http://balabio.dcs.gla.ac.uk/jonesa/RAPAD/ProteinView.php?Query=RPD123), whereby

RPD123 is a unique identifier assigned to each protein. This system enables other databases

to link to protein records in RAPAD. This feature will be especially important for proteins

identified by MS for which there is no annotation in public databases, or the protein is only

annotated as “hypothetical”. In effect, the MS data proves that a protein is expressed under

the particular sample conditions.

It is essential that only data intended for public access can be viewed through this inter-

face in RAPAD. This is ensured because of the design of tables inherited from RAD, whereby

every record across the entire schema is assigned with data privacy settings. Every table

has a set of permissions that highlight which individuals can view a particular piece of data:

the researcher who enters the data (USER READ), only members of the group (GROUP READ)

and anyone (OTHER READ). The group level setting can be used for releasing data to a set of

different laboratories without making data publicly available, for example to allow collabo-

rators in a different location to view or update records. If researchers wish to make their

data publicly accessible, every record in the study has the attribute (OTHER READ) changed

from 0 to 1. Therefore, when any web page is accessed through the public data interface, a

simple check is performed to ensure that no protein data will be accessed where OTHER READ

= 0. Similar attributes exist in every table for ensuring that data can only be changed by

certain individuals (write access). At the present time, the studies supported by RAPAD

have not been published, therefore the interface for making data publicly accessible has only

been tested with artificial data.


5.4.8 Ontologies

In the previous chapter, the importance of developing ontologies to support the develop-

ment of standard exchange formats was outlined. In this section, the implementation of

ontologies within RAPAD is addressed. The OntologyEntry table in RAPAD stores a flat

representation of the MGED Ontology (MO) [211], for specifying protocols, characteristics

of biomaterials and many other parts of the analysis), following the design of RAD. The

OntologyEntry schema is as follows (data security attributes not shown):

Name Null? Type

----------------------------------------- -------- ----------------------------

ONTOLOGY_ENTRY_ID NOT NULL NUMBER(10)

PARENT_ID NUMBER(10)

TABLE_ID NUMBER(8)

ROW_ID NUMBER(12)

EXTERNAL_DATABASE_RELEASE_ID NUMBER(10)

SOURCE_ID VARCHAR2(100)

URI VARCHAR2(500)

NAME VARCHAR2(100)

CATEGORY NOT NULL VARCHAR2(100)

VALUE NOT NULL VARCHAR2(100)

DEFINITION VARCHAR2(500)

The attribute CATEGORY captures the type of term: ProtocolType, DevelopmentalStage,

DataType and so on. An example would be:

• CATEGORY = ProtocolType

• VALUE = nucleic acid extraction

• DEFINITION = "The procedure of extracting nucleic acid from the

biomaterial"

In RAPAD, additional entries have been included in the OntologyEntry table to cover prop-

erties of proteins, such as types of chemical modifications. The storage of post-translation

modification (PTM) data is an important feature of RAPAD, which for instance may be

generated from tandem MS or from a phosphate labelling experiment. The type of PTM,


such as glycosylation, phosphorylation or biotinylation is obtained from the OntologyEntry

table. This has two clear advantages: firstly to reduce manual entry, as terms do not have

to be typed in each time, but are selected from a drop-down menu; secondly, errors and

imprecision should be reduced if the term is presented to the user with a clear definition,

ensuring that there is a shared understanding of exactly what is being specified. It would

not be possible to design an ontology, capable of capturing all terms used in any type of

study. The approach taken in RAD is that users can enter new terms when required, after

being checked by a member of MGED. A similar feature has been implemented in RAPAD,

whereby new terms can be added to the OntologyEntry table by contacting the author.

Terms are annotated as “user defined” along with a URI specifying the source of the term

and a definition to ensure that the origin of the term is clear.

A number of terms describing proteins and proteomics experiments have been added

to the OntologyEntry table during the development of RAPAD. It is important that this

controlled vocabulary is made available to others developing similar systems. The PSI is

developing an ontology (PSI-Ont) as an extension to MO, covering protein terms, and ul-

timately will provide a repository where developers can obtain and add new terms used in

proteomics studies. The vocabulary developed for RAPAD will contribute to PSI-Ont.

A separate part of GUS, known as SRes, stores phenotype information such as disease

states, bibliographic references and taxonomy information. SRes has been installed alongside

RAPAD, and stores a flat representation of parts of the NCBI taxonomy [224], which is in

effect an ontology of species. This means that the names of species are captured in a

controlled way, which facilitates database queries.

5.5 Discussion

The RAPAD system was developed with several main aims: to support the local proteomic

research requirement, to test the extension of RAD into proteomics as a prototype of a future

public repository for proteomics, to assess if FGE-OM correctly models the data semantics

and to test facilities for correlating changes in protein abundance with gene expression values.

In this section, the progress towards these goals is discussed.

5.5.1 A prototype of a central repository

RAPAD has been developed on top of RAD, which is a well established system grounded in

a significant amount of database research. RAD has robust facilities for storing structured


descriptions of biological samples and experimental protocols, and uses ontologies to create a

standard representation of certain concepts. Protocols stored in this way can be queried more

easily than a free text description, and this opens the possibility for data mining in the future.

RAPAD makes use of the features from RAD that ensure data integrity and security, with

facilities for tracking which individuals have entered data, and restricting access to certain

information where necessary. The successful implementation of a proteomics database using

core RAD tables also demonstrates that parts of the schema could be used for other types

of functional genomics study, such as immunohistochemistry.

RAPAD has been tested by the developers of GUS. The developers have taken the

database schema and interface code, and work is underway to add the proteomics com-

ponent to GUS. The addition of proteomics support in GUS will be a major advance for

web sites, such as PlasmoDB that provides access to FG data for Plasmodium falciparum,

the causative agent of malaria. Large volumes of proteome data are being produced for P.

falciparum [110, 229] but there is currently no method for publicly releasing the material in

a format that can be queried, and it cannot be integrated with microarray or genomic data.

One of the goals of developing RAPAD was to build a prototype of a public proteomics

repository. The proteome extension of GUS is underway, utilising the RAPAD database

schema and interface code, demonstrating that the prototyping stage has been successful.

5.5.2 The relationship between FGE-OM and RAPAD

The object model specified in the previous chapter is a proposal for a data standard. However,

a specification expressed solely in UML cannot be used to test if the concepts of the domain

have been correctly modelled, or if real data can be captured in practice. One of the functions

of RAPAD is to demonstrate that real data can be captured by our proposal. We must first

establish the correspondence between FGE-OM and RAPAD, because the database schema

was not created automatically from the object model. Figure 5.2 displays the names of

classes in FGE-OM and tables in RAPAD that cover the same parts of the domain. The

attributes for the majority of tables are identical or very similar to those belonging to classes

in FGE-OM (the database schema and additional diagrams of FGE-OM are displayed in the

appendices). The BioOM and ArrayOM namespaces in FGE-OM contain classes derived

from MAGE-OM. The relationship between these classes and tables in RAD (now inherited

in RAPAD) has been established previously, and software is in development for automatically

converting between MAGE-OM and RAD [202]. Many of the tables in RAPAD that store


proteome data are derived from the PEDRo database schema, and the PEDRo schema and

object model are virtually identical. Therefore, the parts of ProteomicsOM that are derived

from PEDRo are highly similar to the corresponding part of RAPAD. Finally, tables have

been created in RAPAD that exactly correspond with the parts of FGE-OM that are derived

from Gla-PSI. The overall result is that FGE-OM and RAPAD have a very similar structure,

and therefore it is reasonable to state that by illustrating the use of RAPAD in a real research

environment, it is demonstrated that FGE-OM correctly models proteome workflows. The

integration of gene and protein expression results was one of the main goals of developing

FGE-OM, and this functionality is demonstrated in RAPAD in the following chapter.

5.5.3 Support for current proteome studies

A second goal of developing RAPAD was to produce a system capable of supporting on-going

proteomics research, because the currently available databases do not offer all the facilities

that are required. The following two chapters describe projects that are supported by the

current implementation, however in this section a brief description of the main advantages

of RAPAD is given.

The database allows a structured description of experimental protocols and biological

samples to be specified using ontologies. This should improve the capabilities for querying

in the future as data sets become large. This feature is not included in SWISS-2DPAGE

and GELBANK, which only offer fairly simple descriptions of protocols. The data security

features inherited from RAD also provide a simple mechanism for allowing particular re-

searchers or groups to access or modify information in the database. This feature is vital for

large organisations in which many different levels of security could be required.

Data security models

In modern database management systems (DBMS) there are two broad approaches used to

ensure data security: discretionary and mandatory [66]. The security policy can be enforced

at various levels, such as over the entire database, on particular relations, or down to the level

of a single attribute of one row of data. The discretionary approach gives particular rights

to a specific user on different objects in the database, and different users may have different

rights on the same object. Therefore, this model is very flexible but it has a large overhead

if security settings have to be checked for many different objects and users. The alternative

security approach is the mandatory scheme in which certain database objects are assigned


a particular classification, and users are given a clearance level that specifies which data are

accessible or can be modified. The mandatory approach is used in situations where data

fall into particular levels of accessibility, such as government or military databases where

controlling data access is of utmost importance. Security settings can be managed by the

security subsystem of the DBMS, and encoded as a set of rules that must be checked every

time an object is accessed or modified.

In RAPAD, a security system closer to the discretionary approach is employed at the level

of individual rows of data (tuples). However, there are currently no formal rules specified in

the DBMS, instead checks are made by the user interface to ensure that data has been speci-

fied as publicly accessible, or can be modified by a certain user and so on. The attributes that

specify the security setting exist for every tuple in the database. This approach is possibly

not as robust as having security rules set in the DBMS, but this would require a permanent

database administrator to update the rules with every new user or group that utilises the

database. The approach taken in RAPAD should in theory be more robust than ensuring

data security only at the level of the user interface. Additionally, the security settings can

be updated automatically without requiring a permanent database administrator.

Query capabilities

RAPAD has a query system that enables users to generate fairly complex queries to find

particular proteins in a study. The details of MS search results are stored which enable

the quality of a match to a protein to be determined, for example allowing a researcher

to exclude particular proteins that are only weakly identified. The results of a search over

different gels can be displayed in the Gel Viewer, which can load several gels simultaneously

for comparing the proteomes of different samples. The Gel Viewer has other features that are

advantageous compared with other databases, such as the facilities to zoom to an unlimited

depth to visualise small spots. The same region can be highlighted on a different gel to find

differences in the pattern of spots. The Gel Viewer can also display the name of proteins,

and the predicted pI and MW, which can be toggled on or off. There are capabilities that

enable researchers to search for possible post-translational modifications. These features are

exemplified in Chapters 6 and 7.


Integration of gene and protein experiments

In the introduction, it was hypothesised that extending a database schema and graphical

user interface intended for microarray experiments into proteomics, would facilitate the in-

tegration of data across the two domains. In the following chapter, there is a description of

how the results can be integrated by matching the identifiers associated with gene expression

values to the identifiers for protein abundance. However, this is only part of the process.

An advantage of our approach is that biological samples and experimental protocols can be

entered into RAPAD, and are not linked to a particular experiment, but can be used in

any context. For this reason, a sample could be described a single time, using ontologies to

record the type of material, the source (company, organisation, contact details and so on),

and the species of origin. The sample description could then be associated with a microarray

hybridization, 2-DE, or an LC-MS analysis. When a large number of studies of this type

have been entered in the system, the RAPAD Querier will be capable of retrieving all the

experiments that have been performed on a particular sample. Therefore, integration occurs

at the level of results, as described in the following chapter, and at the level of the biological

samples and experimental protocols.

Availability of RAPAD

The database schema, the RAPAD Study-Annotator and the code for the Gel Viewer are

all freely available for download on the web site. Therefore, other developers can install

RAPAD locally to manage their own proteomics data, and there should not be a significant

overhead installing the current version. However, the current version has not undergone

several rounds of testing and therefore may require some modification or bug fixes once

implemented elsewhere.

Features of RAPAD demonstrate the feasibility of integrating proteomics and microar-

ray data in a single system (a specific example of this facility is described in the following

chapter). At present there are no well publicised systems offering this facility. The CEBS

SysBio system [355] hopes to offer similar capabilities in the future for data mining across

a range of experiment types, but a working prototype is not currently available. An inte-

grated database enables researchers to begin asking questions about the correlation between

gene expression and protein abundance at the global level. It is also thought that post-

translational modifications are important for protein function, and their relationship with

gene expression and protein abundance values has not previously been investigated. It is also


likely that a proteome database could discover instances where proteins display modulated

regulation, which would not be observed at the transcriptome level.

5.5.4 Future developments

The current implementation of RAPAD supports proteomics research and can store microar-

ray data. It has been demonstrated that experimental hypotheses, biological samples and

protocols can be stored in common tables, regardless of whether a microarray or proteome

experiment has been performed. RAPAD could therefore be extended to cover metabolomics

experiments, given that metabolome data comprise column separations and mass spectrom-

etry. This would allow for integration across the transcriptome, proteome and metabolome,

giving a broad view of the biological system to the researcher. A future version of the

database could also incorporate a number of features that will improve facilities for data

mining. A number of links to external databases are already provided but this could be ex-

tended. For example, proteins that have a 3-D structure could be displayed using structure

visualisation software, such as RasMol [280] or Chime [216]. For certain studies it would also

be useful to correlate protein abundance with chromosomal location, this could be achieved

using the Expressionview software, which can display a microarray data set, and visualise

the position of genes on the chromosomes [109]. The relationship between chromosomal loca-

tion and gene expression is particularly important for bacterial studies because sets of genes

are often co-expressed from operons, and the genes within one operon often have related

functions.

Functional classification of genes and proteins in RAPAD is provided through dynamic

links to the Gene Ontology (GO), however a great variety of new software is currently in

development by a number of groups for summarising and correlating functional categories

with expression values. Additional software for querying and summarising GO, such as

GoMiner [368] (described in Chapter 2), will be installed alongside RAPAD when it becomes

available.

The current RAPAD implementation does not provide support for any detailed statisti-

cal analysis of data sets. The R software has a programmable interface that allows direct

connection to relational databases [261]. Therefore, pre-defined packages can be used to

search for significant differences in protein volumes, and correlations between gene and pro-

tein abundances. New packages can be also written in R for normalising across mRNA and

protein volume data, and for mining data to search for patterns of co-regulation. These


features would enable protein abundance data to be queried in parallel with gene expression

studies, functional classifications and 3-D structures to improve the facilities for knowledge

acquisition. This kind of statistical analysis requires large data sets from gel electrophoresis,

and more research is required into the accuracy of relative protein volume between two or

more gels, detected by image analysis applications.

5.6 Conclusions

RAPAD supports proteomics research, and comprises a relational database with a web based

interface, which has been created by extending existing technologies. The system uses on-

tologies to capture knowledge in a standardised, controlled manner. This demonstrates that

re-using and integrating existing systems can facilitate integration of different types of data,

and that the time to develop a large system is significantly reduced, compared with develop-

ing de novo. The implementation also acts as a prototype for a major, public repository for

proteomics, which is currently in development. In the following two chapters, two specific

projects are described that allow the core features of RAPAD to be evaluated. The results

will illustrate how the software has enabled researchers to improve annotation of their data,

and formulate queries that facilitate new biological discoveries.

Chapter 6

Database support for proteomic

studies of host-parasite interactions

6.1 Introduction

The RAPAD system was created to test the feasibility of extending an established microarray

database into proteomics, as a step towards creating a single, integrated database for func-

tional genomics. In this chapter, an example is given of a project that is supported by the

current implementation of RAPAD, including an outline of how facilities of the database have

been specifically tailored for making new discoveries in this area. The biological investigation

aims to characterise changes in the proteome of host cells when invaded with the intracellular

protozoan parasite Toxoplasma gondii, from an in vitro culture. This chapter outlines how

the data from this investigation allows the core facilities of RAPAD to be evaluated. A de-

scription is given of additional software that has been developed for: (i) the visualisation of

proteins with modulated expression, (ii) the integration of the proteomics data with previ-

ously published microarray studies and (iii) the discovery of post-translational modifications.

The results enable researchers to formulate hypotheses about the biological processes that

occur during parasite invasion, and gain a better understanding of host-parasite relationships

in general.

6.1.1 Host-parasite interactions

The species Toxoplasma gondii, along with the other closely related parasites Plasmodium,

Cryptosporidium and Eimeria, pose major global problems to human and animal health.

Genome projects are well underway, and functional genomics investigations are being used to

elucidate the biological processes involved in the infectivity of the parasites [6]. Toxoplasma

is used as a model organism for studying related parasites because it is relatively easy and

178

Chapter 6. Database support for proteomic studies of host-parasite interactions 179

safe to culture in vitro, can invade animal models or host cell cultures, and possesses many

of the characteristics of its phylum, Apicomplexa [186]. T. gondii can infect a remarkably

wide range of hosts, including birds, livestock, humans and even oceanic mammals, such as

whales. The parasite is found in almost all geographic regions and infects 10-30% of human

populations [49]. Infection occurs after ingestion of oocysts from the faeces of cats, the

definitive host, or from tissue cysts in infected, undercooked meat. In the majority of cases,

T. gondii forms cysts in the deep tissues, including the brain, where it maintains a life-long

chronic infection. Toxoplasma induces disease in certain cases: (i) the parasite can cross the

placenta to the foetus, causing congenital defects or abortion; (ii) T. gondii can also be fatal

in immuno-compromised patients, for example in individuals with AIDS. It is believed that

the tissue cysts rupture, enabling the parasite to switch from the latent form (bradyzoite)

to a rapidly dividing form (tachyzoite), killing host cells. Therefore, one of the areas for

further investigation is to discover the factors that cause a switch between the two forms

[198]. Substantial work has also been carried out to identify the parasite and host proteins

that are critical for infectivity, and to elucidate the pathways in which they function. It

is believed that the method of invasion is conserved across the Apicomplexa, therefore any

discoveries made in T. gondii could have far reaching consequences.

The parasite invades by the following mechanism (reviewed by Sibley 2004 [291]). Toxo-

plasma releases molecules (adhesins) that attach to surface receptors on host cells. The para-

site actively penetrates the membrane, enclosing itself within a vacuole (the parasitophorous

vacuole) that is primarily formed from the host’s cell membrane, thereby reducing the ability

of the host to recognise and reject the parasite. The parasite releases the contents of a set of

organelles into the cytosol, including rhoptries that are crucial for parasite infectivity [29].

Rhoptries release a set of proteins that cause the parasitophorous vacuole to interact with

host cell mitochondria and endoplasmic reticulum, allowing the parasite to scavenge glucose

and cholesterol. An understanding of the proteins and pathways involved in infectivity has

been developed over several decades using classical techniques, such as gene knockout experi-

ments [185], but developments in technology have now opened up the possibility of analysing

the systems on a much larger scale.

6.1.2 Genomic investigation of Toxoplasma

The genome of T. gondii is currently being sequenced [187] and access to large EST databases

has been available for several years [321]. Therefore, T. gondii can now be investigated using


functional genomics techniques, allowing researchers to gain a wider view of the systems

involved with infectivity than previously possible. The genome is 80Mb (Megabases) in

size, contains 11 chromosomes and, as of early 2004, there is a ten times coverage of the

sequence [320], created using the “shotgun” approach [52]. Many genes have little or no

functional assignment, therefore any studies that provide insights into gene function will aid

the annotation efforts. A previous study investigated the constituents of the proteome of the

tachyzoite (rapidly dividing stage) of T. gondii by two dimensional gel electrophoresis (2-DE)

[61]. The study discovered that the same proteins appear in a number of positions on a single

gel, indicating that differential splicing of gene products, or post-translational modifications

are common. A separate investigation into the proteomics of Toxoplasma demonstrated a

protocol for a 2-D gel map of the tachyzoite stage [79]. Microarray studies have also been

carried out by Gail [116] and Blader [35], discussed below.

6.1.3 Microarray analysis

A more detailed understanding of the function of proteins from Toxoplasma, and the com-

plex networks of interacting proteins, will greatly facilitate the search for new drug targets.

However, researchers also wish to focus on how the parasite interacts with host cells, and

what changes occur in the functioning of the host cells. Microarray studies by Blader and

colleagues [35] determined the genes that are significantly up or down-regulated in a host

cell culture (Human Foreskin Fibroblasts, HFF) when invaded by the parasite, compared

with non-infected host cells, at a number of time points after invasion (1-24 hours post in-

fection). Several groups of genes displaying modulated expression were defined, leading to

hypotheses about the mechanisms of parasite invasion, and the recruitment of host processes

for its own survival. It is believed that the parasites arrest the cell cycle to enable them to

continue utilising host resources as long as possible. An important mechanism for host cell

defence against parasites and viruses is the apoptosis cascade, which causes host cells to die,

thus preventing further development of the intracellular pathogen. Evidence suggests that

Toxoplasma switches on a number of host genes that inhibit and prevent the propagation

of the apoptosis cascade [292]. The microarray results revealed down regulation of genes

implicated in mitosis and meiosis (cell cycle processes), apoptosis genes, and cytoskeletal

proteins. The role of calcium dependent signalling during parasite invasion has also been

studied in detail (reviewed by Arrizabalaga and Boothroyd [17]). Some evidence suggests

that Toxoplasma utilises its own calcium dependent pathways, unlike other parasites that


sequester host pathways, therefore one area of study is to determine if there are also changes

in the host genes implicated in these processes. Blader also discovered up-regulation of

genes involved in glycolysis and cholesterol synthesis for energy generation. Infection by

the parasite is resisted by host cells, and therefore it is expected that an up-regulation of

genes involved in the immune response would be observed. In the microarray study, an early

up-regulation of these genes was observed, at one hour post-infection.

A later study by de Avalos, Blader and colleagues [73] performed a microarray experi-

ment, similar to Blader 2001, on the related organism Trypanosoma cruzi. T. cruzi is also

an intracellular pathogen believed to invade by a similar mechanism. The results indicated

that very few host genes were up-regulated early in infection, unlike the T. gondii data,

and across the whole data set, the correspondence in up-regulated genes between T. gondii

and T. cruzi was very low. This has important implications for general understanding of

host-parasite interactions. It has previously been thought that the response of host cells

to invasion by a parasite would be the same, or similar, regardless of the type of parasite.

However, the comparison of the T. gondii and T. cruzi data suggests that there may be

different mechanisms used by host cells to respond to invasion by different parasites. The

consequence of this finding is that drug development should be targeted towards disrupting

very specific processes for specific parasites, rather than targeting a single set of processes

to prevent invasion by any kind of parasite. It is important that the host responses to a

number of parasites are studied in more detail to elucidate the mechanisms involved.

6.1.4 Support for proteome studies

RAPAD is supporting a project from the laboratory of Jonathan Wastling in the Institute

of Biomedical and Life Sciences at the University of Glasgow. The investigations were

performed by Morag Nelson, a PhD student, as part of a project to investigate the changes

in the proteome of mammalian host cells when invaded with T. gondii, compared with non-

infected host cells, at 24 hours post infection. The investigation uses 2-DE for protein

separation, coupled with MS (mass spectrometry) for protein identification. The specific

aims of the biological investigation are as follows. Firstly, to verify if changes observed at

the transcriptional level (by microarray analysis) are confirmed by changes in the amount of

protein produced. Secondly, it is believed that because proteins are the functional unit in the

cell, protein abundance is a better indicator of functional significance than gene expression

values. Therefore, new groups of proteins could be discovered with modulated expression,


which were not found by microarray analysis, leading to the formation of novel hypotheses.

A third aim is to investigate what role post-translational modifications (PTMs) might play

in parasite infectivity.

The experiments present considerable computational challenges that enable the evalu-

ation of the core facilities of RAPAD in three key areas: managing large volumes of data

across replicates, enabling complex queries, and visualisation of results to allow new findings

to be derived. In this chapter, we report on additional work by the author to develop specific

queries and visualisation software, in order to enable differential expression of proteins to

be detected across two conditions from a number of replicate gels. Facilities have also been

developed to integrate microarray data points with the corresponding proteins identified by

MS, and to support the storage and querying of PTMs, in conjunction with gene expres-

sion and protein abundance data. The integration of transcriptome and proteome data may

answer several questions:

• The interval between changes in gene expression and protein abundance. If

genes are up-regulated immediately after infection, when are changes observed in the

level of protein?

• Translational control: are there groups of proteins with modulated expression that

were not associated with a change in gene expression?

• Post-translational modification: do groups of proteins undergo changes in modifi-

cation status that are functionally significant, where there is no change in the rate of

transcription?

6.1.5 Project status

The current status of the biological study is as follows. 14 gels produced by 2-DE, from seven

infections with T. gondii and seven non-infected cell lines, have been loaded into RAPAD.

From the gels, approximately 350 differentially expressed spots have been identified. There

are 130 distinct proteins out of the 350, because some proteins appear in multiple copies in

different places on the 2-D gels, and in some cases the same protein has been identified on

replicate gels. Currently, about 40 proteins spots (14 distinct) have been matched to the

corresponding microarray clone, although it is expected that this number will increase as the

number of protein records in RAPAD increases (discussed in Section 6.4).

The rest of the chapter is structured as follows. Section 6.2 briefly describes the biological


methodology, how differential protein expression data is visualised, how microarray data

points are matched to protein records, and the techniques employed to assign, display and

summarise functional classification of proteins. An overview of the results is given in Section

6.3, focusing on how RAPAD has supported the generation of new hypotheses. Discussion

is provided in Section 6.4.

6.2 Methods

The source of biological material for the investigations was a human foreskin fibroblast

(HFF) cell line, which was prepared and infected with Toxoplasma gondii, using a protocol

reproduced from the microarray study by Blader et al. 2001 [35]. This should ensure that

the proteome data from these studies are, as far as possible, comparable with the earlier

microarray analysis. The experimental protocols for protein solubilisation, the IPG strip

(first dimension separation), the gel electrophoresis stage (second dimension) and staining

are stored in RAPAD. Eleven biological replicates (infected versus non-infected, 22 gels)

were performed but most of the examples given are from a single replicate (replicate 11).

Coomassie blue stain was used to visualise proteins, gels were scanned with a standard

laboratory scanner and images were analysed using the ImageMaster 2D Elite software [162].

The matching of spots on two different gels (pairwise between replicates) was performed by

the 2D Elite software, which also measured the spot volumes. Differential expression of

proteins was determined as follows. Spots with a volume difference of greater than 30% were

picked for MS analysis, or spots that were present on one gel and not on the other determined

by manual inspection. The gels were normalised to background on a per spot basis, after

background subtraction had taken place, using the method “normalisation at lowest on

boundary”. The spot coordinates and volumes determined by 2D Elite were imported into

RAPAD. Samples were sent for MALDI-TOF (Matrix Assisted Laser Desorption Ionisation

- Time of Flight) analysis and identifications were made using the MASCOT software [207].

The samples that did not produce a significant protein identification were analysed using a

tandem MS system (AB Q-Star Pulsar).

The contribution of the author was: (i) to develop the core RAPAD system, as described

in the previous chapter; (ii) to create additional displays of differential expression (Section

6.2.1); (iii) to write software for matching gene expression values to protein abundance data

(Section 6.2.2); and (iv) to develop scripts to retrieve identifiers that enable hyperlinks to be

created from RAPAD to external software, in order to provide a summary of the functional


classification of each protein in this specific study (Section 6.2.3).

6.2.1 Display of protein data from different gels

The previous chapter described facilities in the RAPAD Gel Viewer. The Gel Viewer enables

multiple gels to be loaded simultaneously, the results of searches to be viewed, and offers the

display of the predicted charge and molecular weight of the proteins. These features allow

a researcher to search for PTMs and analyse the proteins that have been identified in the

study. For the Toxoplasma investigation an additional interface was created to improve the

visualisation of proteins that are differentially expressed on 2-D gels. The interface addresses

the problem of spot interpretation where certain proteins appear in several different positions

on individual gels, corresponding to particular PTMs or differentially spliced forms of the

protein. A series of gels have been performed from replicate samples, in which there may be

supporting or contradictory evidence, and this information must be assimilated. The first

goal was to develop additional software to aid researchers to define the spots on different

gels that correspond to the same protein.

EXAMPLE: Spots matching protein XYZ1 appear ten times on gels from infected samples

(across replicates), and three times from non-infected samples (exemplified in the Results

section, Figure 6.4). A visualisation has been created that shows the exact regions that

XYZ1 appears on the different gels, to enable the researcher to say how many different

forms of XYZ1 exist in total, and which different forms are up or down-regulated. It may

be the case that the three spots containing XYZ1 from the non-infected sample correspond

to a particularly modified form of the protein that has the same abundance in infected and

non-infected samples. However, the additional spots from infected samples correspond to a

different form of XYZ1, which is produced in greater abundance, and is crucial for parasite

infectivity. The visualisation system displays which forms of the protein are up-regulated,

down-regulated or have stable abundance.

A query has been developed in RAPAD that returns a page that lists the proteins that

have been identified across all replicates. The researcher selects the proteins they wish to

investigate and the Gel Viewer opens, highlighting the proteins selected on all replicates,

with each replicate gel loaded in a separate tabbed window of the viewer. The researcher

can zoom on the proteins and note the ID numbers of spots in the same position. The ID

numbers of spots in the same positions are manually entered into a text file by the researcher,


and it is loaded into the database (into the tables MatchedSpots and MultipleAnalysis).

This allows a spot set to be defined that corresponds to the same form of the same protein on

different gels. After a spot set has been defined, a second interface displays the spot sets and

the volume of individual spots on different gels, if these values have been entered in RAPAD,

to display which spots appear in greater or lesser volume. This should allow the researcher

to define a particular variant of a protein (one spot set) as up or down-regulated during

infection. Section 6.3.1 gives an example of how the software has been used in practice to

identify differentially expressed proteins in the biological investigation.

6.2.2 Comparison of protein and gene expression data

The experimental protocol for infecting an HFF cell line with T. gondii for the proteome

study was reproduced from Blader’s study, as detailed above, therefore it should be possible

to make comparisons between the expression of a gene measured by a microarray, with the

protein abundance value obtained in this study. The microarrays of Blader were created

according to a standard protocol, and are supplied with an identifier of the cDNA clone

(example IMAGE:123456) and of the GenBank cDNA record. The cDNA record does not

share an identifier with either the protein record returned by MASCOT [207] (the software

used to identify proteins following MS), or with the corresponding nucleotide record found by

following a link from the protein record. Therefore, performing matching between microarray

clones and protein sequences is not a trivial task.

An initial attempt to find corresponding gene and protein records used pattern matching

over the names of the microarray features and the protein names, expressed as an SQL query,

in the following way. A query is deployed to match the first word of both the microarray

clone name, and the protein name. A list of exceptions is generated where the first word

occurs frequently and is not informative, such as “hypothetical” or “protein”. In these cases,

other words in the protein name are analysed to find matches. A list of potential matches is

supplied to the user, and sensible matches are returned in only approximately 50% of cases

because the following problems arise:

• If synonyms exist for gene names, one name may be used for the cDNA clone and

another for the protein.

• Certain words occur frequently in gene names which cause incorrect matches to be

found, such as “alpha” or “beta”.


Retrieve DoTS ID numberfor each sequence

Store local copy of DoTSID for each sequence

Store mapping from DoTS IDto DoTS gene record

FGB451.2HYAB22.1DDRA44CAB224.2LF11AH.1

QARTGH

....

RAPAD

OUTPUT: Microarray gene name | Gene expression value| Protein Name | Protein Volume

PDB

2−D gel dataMass SpectrometryList of cDNA clone IDs

ABDG45.3NW4523HWEIU9.1JKHL652.1HGF456.2

NMD123.1

....

PIR

List of Protein IDs

List of Genbank nucleotide IDs

Retrieve DoTS gene number for every microarray result Retrieve DoTS gene number for every protein

Join query

Swiss−ProtGenbank

Retrieve Genbank nucleotide IDsusing BioJava

DoTS at AllGenes.org

Figure 6.1: The process of matching microarray data to protein abundance data.


• Gene families exist with a number of closely related entries, such as Tropomyosin 2,3

and 4 which have closely related sequences, therefore microarray clones or protein

records may have been annotated incorrectly, or a search with the MS data may return

the incorrect entry.

• More generally, annotation in the databases is prone to inaccuracy and is being con-

stantly refined.

Further improvements to the algorithm for matching names would improve specificity but

it would be very difficult to engineer a robust method that would succeed in all situations.

Therefore, a different approach has been implemented in RAPAD using AllGenes [10]. All-

Genes is a web site that provides access to the Database of Transcribed Sequences (DoTS)

that collects all the different identifiers that a particular sequence (cDNA, mRNA, DNA)

could be assigned, which correspond to the same underlying gene. For example, the gene:

“heterogeneous nuclear ribonucleoprotein F” has a GenBank record for the protein sequence

(gi|4826760), nucleotide record (NM 004966), a microarray specific ID (IMAGE:345833), and

the corresponding cDNA GenBank ID (W72693). The DNA, cDNA and microarray identi-

fiers are each assigned a DoTS number, and collections of DoTS entries that correspond to

the same underlying object (gene) are assigned a single DoTS gene number (DG.36388269).

DoTS entries have been created by performing sequence similarity searches, and assembling

clusters of sequences that corresponds to the same object. A significant number of DoTS

entries have been manually curated.

The following series of actions is used to match protein records to microarray clones

(summarised in Figure 6.1).

1. RAPAD stores a URL referencing a web page on an external server for visualising

MASCOT results. A script retrieves GenBank protein IDs from the web page.

2. Protein records are retrieved from GenBank using the API (Application Programming

Interface) provided by BioJava [34]. Many GenBank protein records have a link to

the corresponding nucleotide record under the data type: DB Source, except for cases

where the protein sequence originated from a 3-D structure, or a database other than

GenBank, such as Swiss-Prot. In these cases, the nucleotide record must be found by

following a series of links manually (approximately 10% of proteins), or performing a

sequence similarity search on the GenBank nucleotide database.


3. The DoTS web site allows programmatic access for single entries, and has batch capa-

bilities, but does not currently scale up for accepting very large numbers of identifiers.

Therefore, the DoTS database has been downloaded in flat files, and the UNIX grep

utility was used to search the files for the DoTS identifiers for GenBank nucleotide

records (found automatically from MASCOT or found manually) and cDNA records

(from the microarrays).

4. DoTS identifiers for microarray clones or proteins are stored in a newly created table

in RAPAD. A mapping from all DoTS identifiers to the corresponding DoTS genes is

stored in a table in RAPAD that can be queried when required.

5. An SQL query finds DoTS numbers for every protein, and retrieves the corresponding

DoTS gene number. A search is performed to find any microarray features that have

a DoTS number that has been mapped to the same DoTS gene ID.

The results of matching protein data to microarray results are displayed in the RAPAD

interface in a table, showing properties of the protein with links to the full protein record.

The microarray results from the different time points are displayed alongside. If the protein

has been matched across the two gels (infected and non-infected in this case), and volume

measures have been found for the two gel spots, the ratio in protein volume is displayed

alongside the microarray results. When large datasets are assembled it should be possible

to determine the correlation between gene expression and protein abundance for a series of

time points. This will enable the lag between the up-regulation of a gene and the production

of new protein to be calculated on a large scale.

6.2.3 Functional classification of proteins

Proteomics experiments generate large quantities of complex data, therefore analysis is re-

quired that can provide summaries, to generate a better understanding of the whole system.

The biological investigation reported in this chapter is analysing the changes that occur in

the human proteome, and there are a great number of resources available for characterising

human proteins. One example is the Gene Ontology (GO) project [126] (described in Chap-

ter 2), which has assembled a large amount of information about the function of proteins. In

RAPAD, GO ID numbers are stored for all proteins identified in this study and hyperlinks

have been created to the AmiGO browser [12]. AmiGO graphically highlights the position of

the term, and has controls for traversing up and down the GO tree, enabling the researcher


Figure 6.2: Output from GoMiner, displaying the GO tree browser open for the geneTropomyosin 1.

to view the hierarchical classification of a gene (or protein). However, this system is not ideal

for a large collection of proteins because the knowledge about function must be manually

assembled by browsing, and is difficult to summarise because it is difficult to know from

which depth of the tree to store functional information. For certain proteins, the lowest

depth may provide useful annotation, but in other cases a more general classification (higher

up the hierarchy) may be more informative. Therefore, additional tools have been used that

summarise GO classifications: GoMiner [368] and FatiGO [7].

GoMiner accepts a list of gene symbols1 from one or two experiments, and displays

summaries of where genes have been found in the hierarchy. GoMiner also displays which

branches of GO are linked to genes that are up or down-regulated with statistics (described in

more detail in Chapter 2). For example, if three genes involved with cytoskeletal development

are up-regulated and one is down-regulated, this result would be displayed graphically, with

a statistic indicating that, for this set of conditions, cytoskeletal proteins tend to be up-

regulated (Figure 6.2).

FatiGO provides access to GO over the Internet, and has similar goals to GoMiner.

1A gene symbol is an official annotation for every human gene from the Human Genome Organisation(HUGO) [156]. Example: the gene actin beta has the gene symbol ACTB.


Figure 6.3: Output from FatiGO showing the classification of up and down-regulated proteinsin the Biological Process branch of GO at a depth of 3, the third lowest (Query = infectedcells, Reference = non-infected cells).


FatiGO provides summaries of where up and down-regulated genes appear in GO. FatiGO

accepts lists of gene symbols that have been highlighted from two experiments, and allows

the user to select the depth of the hierarchy and which branch of the three classifications

in GO to display. A visual summary of results is displayed with p-values to indicate the

significance of the association between one of the two conditions in the experiment and a

particular branch of GO (Figure 6.3).

FatiGO and GoMiner can also be used to classify proteins instead of genes, but both tools

require gene symbols as input rather than GO identifiers or GenBank accession numbers.

A set of scripts were developed by the author to retrieve the gene symbols from GenBank

nucleotide records for all the proteins highlighted in this investigation. The gene symbols are

stored in RAPAD, and are also used to create web links to the Ensembl genome browser [58]

for visualising the chromosomal location of the gene, as well as linking to GenAtlas [121] and

GeneCards [268]. GenAtlas and GeneCards summarise information about the function of

genes, display the intron/exon structure, provide physical maps showing other genes in the

localised region, give expression values in different human tissues, and display the domains

of the protein.

6.3 Results

The introduction outlined several key changes that are thought to occur in host cells when

invaded with Toxoplasma gondii. The proteome project had several major hypotheses to test,

which required significant database support. In this section, an outline of the results of the

analysis is given in four areas: the display of differentially expressed proteins, software that

aids the functional annotation of proteins, the integration with microarray results and the

search for post-translational modifications. The purpose of this chapter is to focus on how

RAPAD has facilitated these processes for the experiments with Toxoplasma, using several

examples of proteins highlighted by the study which may have a role in the infectivity of

the parasite. The proteome investigation is still continuing, and a complete report of the

biological results is beyond the scope of this work.

6.3.1 Visualisation of differential expression

The development of software for the visualisation of spots on different gels corresponding to

the same protein was described in Section 6.2.1. In this section an example is given of the

usage of the software, in the context of the T. gondii infection data.


Spots 29 and 27

Spots 25 and 24

Spot IDs 42 and 41

Figure 6.4: The interface for viewing spots across replicate gels. A table displays proteins or-dered by name, allowing the researcher to select entries that have been identified as the sameprotein across different replicates, in this case ACTB. The Gel Viewer opens, highlightingthe proteins in different windows to allow the researcher to assess which spots correspond toeach other on different gels. A polygon has been overlaid to demonstrate that spots 42 and41 from non-infected replicate 1 appear to correspond with spots 29 and 27 from non-infectedreplicate 3. Gel images courtesy of M. Nelson.


The process is demonstrated in Figure 6.4 for six spots containing the protein ACTB,

which appears in 26 spots in total. In this example, there is a cluster of four spots matched

to ACTB on one gel, and two spots on a replicate gel. The corresponding region has been

highlighted for the two gels, and a polygon has been drawn2 to demonstrate that spots 41

and 42 from non-infected replicate 1 correspond with spots 27 and 29 from non-infected

replicate 3. In this example, spot 42 (replicate 1) and spot 29 (rep. 3) form one spot set 3

and spot 41 (rep. 1) and spot 27 (rep. 3) form a different spot set. The region can then be

compared on gels from infected samples to see if this particular form of the protein, in this

exact position, is up or down-regulated.

The Gel Viewer, in combination with the RAPAD query system, allows differential ex-

pression of proteins to be visualised. An additional view of the data has been created which

will allow the results to be made public when the study is published in a journal. A total of

130 differentially expressed proteins have been identified by the researcher, which are stored

in RAPAD. Figure 6.5 displays the interface for viewing data that is combined across repli-

cates. The data can be viewed in a table that provides links to the individual protein records,

and enables any number of proteins to be selected and opened within the Gel Viewer. There

are facilities for investigating the function of the proteins, addressed in the following section.

6.3.2 Functional annotation of proteins

The software described in Section 6.2.1 facilitates the determination of a set of proteins that

show changed expression between infected and non-infected host cells. Each protein record

has links to a number of external databases: GenBank displays the nucleotide and protein

sequence; Harvester [33], GenAtlas, and Genecards summarise a large amount of information

that has previously been assembled for each entry; and Ensembl enables a researcher to

visualise the chromosomal location of a gene. A link to the Gene Ontology record for the

protein is also provided, allowing the researcher to build a complex picture of the function

of each protein. RAPAD includes an option for annotating a protein spot with a textual

description, thereby allowing new findings, that have been derived from external sources, to

be recorded in the database.

Proteins with modulated expression in this study could potentially fall into three cate-

gories:

2The polygon was created manually by the author to clarify which spots correspond to each other acrossthe two gels.

3A spot set is defined as a group of spots in the same position on different gels, corresponding to a specificisoform of the protein.


Figure 6.5: The interface for displaying data combined across replicates. The top imagedisplays the option for assigning groups of gels to two different conditions (infected versusnon-infected). The lower image shows the table of proteins that have been identified in eachgroup of gels.


• Host proteins actively up or down-regulated by the parasite, required for invasion or

maintaining infection.

• Proteins expressed by host cells in an attempt to resist parasite infectivity.

• Proteins with altered expression, caused indirectly, as a result of other proteins being

up or down-regulated.

It is therefore important to consider when analysing changes to the host proteome, whether

or not the change is facilitating parasite infectivity, as this has major consequences for the

interpretation placed on the result.

Example: Differential expression of Cathepsin B

One of the proteins found to be differentially expressed by the researchers was the protein

Cathepsin B, which cleaves proteins, transforming them from their initially transcribed form

(the prepro protein) into the functional form. Previous studies have suggested that Cathepsin

B from T. gondii is required for infectivity and rhoptry protein processing [260]. The pro-

teome studies described here, along with the previous microarray experiments, suggest that

human Cathepsin B is down-regulated during infection. A study by Que et al. in 2002 [260]

demonstrated that inhibition of Toxoplasma Cathepsin B prevented the parasite from infect-

ing cells, and was therefore a potential drug target. The study by Que also demonstrated

a significant sequence and structural similarity between human and Toxoplasma Cathepsin

proteins. Therefore, the finding that human Cathepsin B is down-regulated during infection

raises the possibility that human Cathepsin interferes with correct processing of Toxoplasma

proteins. If this proved to be correct, induction of expression of human Cathepsin could

prove to be an inhibitor of Toxoplasma infectivity. However, the situation is more complex

because human Cathepsin has also been implicated in the apoptosis pathway [139], and

one of the critical factors enabling a parasite to maintain infection is inhibition of apop-

tosis. Therefore, Toxoplasma may cause the down-regulation of Cathepsin to prevent the

cell entering apoptosis. This demonstrates that there is a significant information retrieval

task required to understand the results after particular proteins have been highlighted. The

interface provided by RAPAD allows the researcher to assimilate the results from past experi-

ments rapidly, via other Internet accessible resources (Figure 6.6), and record the information

within the database.


Figure 6.6: The protein record for Cathepsin B in RAPAD has external links to AmiGO[12], GenBank [30] and GeneCards [268].


Summary of biological results

Since the results of the investigation with T. gondii will be published by Dr Wastling and

Morag Nelson at a later date, a complete description of the results of the biological inves-

tigation is outside the scope of this work. When the results are ready for publication, the

RAPAD interface will provide public access to the data, as described in Section 6.3.5.

Cathepsin B, described above, is one of many proteins found to have modulated expres-

sion during parasite infectivity, which demonstrates the effectiveness of the experimental

approach adopted by Dr Wastling and the software developed in this investigation. Initial

results from the proteomics investigation have discovered down-regulation of proteins in-

volved in the formation of the cytoskeleton, as expected due to the ability of the parasite

to halt new cell growth and cell division. Other proteins implicated in apoptosis, such as

cytochrome c, are also down-regulated, and there is an up-regulation of proteins involved in

the host’s response to stress. The following section describes work by the author to match

the protein abundance values from this study, to gene expression values from the previously

published microarray experiments. Several examples are given of proteins that have been

shown to be differentially expressed, which have been highlighted for further investigation.

6.3.3 Comparison with microarray data

We have developed software to match proteins identified by MS to the corresponding clones

from the microarray study by Blader and colleagues, in order to discover the correlation be-

tween gene expression and protein abundance. The Blader experiment contains two relevant

datasets. The first is a time course experiment to highlight genes with altered expression

at 1, 2, 4, 6 and 24 hours post infection with T. gondii. The second data set from Blader’s

microarray experiment contains an analysis, from two independent infections, of the genes

that were most strongly up or down-regulated at 24 hours post-infection. The proteomics

experiment carried out at Glasgow determines the abundance of proteins at 24 hours post-

infection. It is likely that there is a lag between an up-regulation in gene expression, and

the production of new protein, although the length of time is not known exactly.

The technique to match data points present in both data sets performed correct matching

between gene and protein identifiers. However, due to the limited coverage of both exper-

iments, the datasets are not currently large enough to infer global information about the

rate of translational control for Toxoplasma proteins. The results of the matching, displayed

in Table 6.1, provide qualitative information about the correspondence between the rate of


Figure 6.7: The table in RAPAD displaying protein abundance and gene expression values.The column headings are as follows: 1 = Spot ID, 2 = Protein name, 3 = cDNA clonename, columns 4 to 8 are relative gene expression values from a time course experiment, andcolumns 9 and 10 are relative expression values from a separate microarray hybridization(24 hour time point, see Section 6.3.3). Column 10 = spot ID of matching spot on a secondgel and column 11 is the ratio of protein volume between the two gels.


Protein Name 1h 2h 4h 6h 24h 24h(i)

24h(ii)

Up-regulatedAnnexin-1 2.02 0.65 1.18 0.95 0.79 — —Heterogeneous ribonucleoprotein F — — — — — 2.77 2.30HS70kDa protein 8 isoform 1 — — — — — 2.58 2.22Nucleoside diphosphate kinase 1 1.82 0.75 1.37 1.12 2.55 — —Phospholipase C alpha or Protein disulphideisomerase

— — — — — 2.03 2.64

Thioredoxin peroxidase — — — — — 2.14 2.04Tubulin beta — — — — — 3.94 3.00Villin 2 1.53 1.02 1.26 1.34 2.50 — —

Down-regulatedActin beta 0.69 1.05 0.74 1.07 0.44 — —AHNAK (Desmoyokin) — — — — — 0.41 0.15Cathepsin B 0.95 0.96 0.84 1.00 0.47 — —Dimethyl arginine dimethyl aminohydrolase 1.13 0.84 1.01 1.29 2.02 — —Heterogeneous ribonucleoprotein F — — — — — 2.77 2.30Superoxide dismutase 1.50 3.56 3.77 4.72 1.69 — —Tubulin beta — — — — — 3.94 3.00Vimentin 0.87 0.73 0.81 0.74 0.43 — —

Table 6.1: The correspondence between gene and protein abundance for HFF cells infectedwith T. gondii. Column 1 contains the names of proteins identified in the proteome study,which are up or down-regulated during parasite infection. The numerical values are thecorresponding gene expression values from the study by Blader [35] from a time courseexperiment (columns 2-6) and two independent infections at the 24h time point (columns 7and 8). The values are the ratio of the expression of the gene that corresponds to the proteinin column 1, from infected versus non-infected samples. A value greater than 1 indicates thegene is up-regulated during infection, less than 1 indicates that the gene is down-regulated.The — symbol indicates that the value was not present in the Blader study.


2)

2)

1)

1)

Figure 6.8: The top image displays a part of the gel from the infected sample at a highermagnification, and the bottom image is the non-infected sample. Spots matched to vimentinare highlighted. The cluster of spots marked 2 is present on both gels. The cluster of spotsmarked 1 is only present in non-infected samples. Gel images courtesy of M. Nelson.


transcription and translation. The first column in Table 6.1 displays the proteins that have

been found to be up or down-regulated during infection in the proteomics investigation, and

have been matched to a gene in the Blader study. Proteins are identified as up-regulated

in infected samples if they appear in a larger volume on gels from infected samples, or the

spot is present in the infected sample and absent in the non-infected sample. A protein is

defined as down-regulated if it appears in a larger volume, or is only present on gels from

non-infected samples. Columns 2-6 display the expression values at the five time points post-

infection from the Blader study for the gene that corresponds to the protein in column 1.

Columns 7 and 8 display the expression values for genes that have been matched to proteins

in this investigation, from two further independent infections at 24 hours post-infection in

the Blader study. Table 6.1 summarises fairly complex data, as for example vimentin and

actin both appear in multiple copies on gels from infected and non-infected samples. Both

vimentin and actin are defined as down-regulated because there are spots clearly present

across replicates on non-infected samples, which are not present on infected gels. Figure 6.8

displays the spots matched to vimentin from infected and non-infected samples. The spot

cluster 2 is present on both gels in roughly similar volumes. Cluster 1 is only present in non-

infected samples. This indicates that several forms of vimentin with particular modifications

are down-regulated during infection.

The spots matched to actin beta are displayed in Figure 6.9. The pattern of spots indi-

cates that particular forms of actin beta are less abundant during infection, or it may reflect

the fact that the total volume of all spots is reduced, and certain spots cannot be viewed at

very low volumes. Both vimentin and actin are implicated in cytoskeletal development, and

may be down-regulated because Toxoplasma arrests the host’s cell cycle. Tubulin beta and

heterogeneous ribonucleoprotein F appear in both halves of the table because some forms of

the proteins appear in greater volumes in infected samples, and other forms in non-infected

samples. Therefore, there may be a different type of modification that causes spots to shift

positions on the 2-D gel, and it is not possible to state simply whether the proteins are up

or down-regulated.

Up-regulated proteins

There are three genes: HS70kDa protein, protein disulphide isomerase and thioredoxin per-

oxidase that are strongly up-regulated in the Blader study at 24 hours, and the proteins

are also up-regulated in this investigation. HS70kDa is a heat shock protein that is released


Figure 6.9: Spots matched to actin beta from infected (top) and non-infected (bottom)samples. Gel images courtesy of M. Nelson.


when the cell is placed under stress, therefore it may represent a host cell response to infec-

tion. Thioredoxin peroxidase is implicated in oxidative stress and regulation of transcription

factors, and may also be a sign of a host cell response.

The comparison data reveals that both phospholipase C alpha (PCA) and protein disul-

phide isomerase (PDI) are predicted to match the same microarray clone, annotated as a

“glucose regulated protein” (accession R33030). The 2-D gel data also reveals that spots

containing phospholipase C alpha are also predicted to contain PDI, based on MS results.

Further analysis reveals that GenBank contains exactly the same protein sequence for both

PCA (BAA03759) and PDI (JC5704). The Harvester database contains a different, unrelated

protein sequence for PCA (Harvester ID Q15111), but PDI has the same protein sequence in

Harvester and GenBank. This indicates that the PCA record in GenBank contains an incor-

rect protein sequence. It appears that both the proteomics and microarray data agree that

PDI is up-regulated in response to parasite infection. PDI functions to rearrange sulphide

bonds in proteins, and the up-regulation may be due to a general increase in proteins that

must be produced during infection. PCA may not be implicated in this study at all, and

if it is incorrectly annotated in GenBank, the record should be updated. The public access

part of RAPAD, described in Chapter 5, will allow other databases to connect to RAPAD

when the proteome data has been published.

The proteome studies reveal that Annexin-1 is up-regulated during infection at the 24

hour time point. It is interesting to note that the gene expression studies suggest that

Annexin is up-regulated early, and then down-regulated later. This would suggest that

there is a large lag between changes in gene expression and the production of new protein,

however much larger data sets would be required to confirm and quantify this hypothesis.

The record in the SOURCE database [78] for Annexin suggests that it is involved with

exocytosis, membrane fusion and an anti-inflammatory response. The Swiss-Prot database

specifies that Annexin can be phosphorylated, leading to inactivation. The 2-DE data reveals

two adjacent spots that may be the result of differentially phosphorylated forms, which have

been further investigated (Section 6.3.4).

Down-regulated proteins

There are eight proteins that have been classified as down-regulated in the protein investiga-

tion and which have been matched to microarray data points. The apparent down-regulation

of the proteins actin beta and vimentin has been discussed above. The gene expression data


suggest that vimentin is down-regulated as expected, but the results for actin beta are less

clear, although on average the gene for actin beta seems to be down-regulated. The function

of Cathepsin B was discussed in Section 6.3.2, and it appears that the microarray data sug-

gest the gene is slightly down-regulated early in infection, and very strongly down-regulated

late in infection. AHNAK (Desmoyokin) appears to be down-regulated in both the proteome

and microarray investigation. It is believed to have various roles, including signal transduc-

tion and regulation. The GenAtlas entry suggests that AHNAK plays “a regulatory role of

the actin-bound cytoskeleton to the l-type Ca2+ channel”, which would suggest that it may

be down-regulated as part of the inhibition of cell cycle and cytoskeletal development, caused

by the parasite.

There are several forms of the protein heterogeneous ribonucleoprotein F that appear in

higher volumes in infected samples, but other forms appear in lower quantities. The mi-

croarray experiments suggest that the gene is strongly up-regulated. The protein is involved

in RNA processing. It would be expected that more genes are expressed when the cell in

under stress, such as during infection. A general increase in gene expression should correlate

with higher RNA processing, and we might expect that heterogeneous ribonucleoprotein F

would be up-regulated. The finding that there are different variants of this protein may

suggest that an activated form of the protein is present in much higher volumes in infected

samples, and spots that are larger in non-infected cells correspond with a de-activated form

of the protein. Additional investigations into the PTMs of the protein would be required to

confirm this hypothesis.

The protein for dimethyl arginine dimethyl aminohydrolase appears in lower abundance

during infection in the proteomic study but the microarray data suggest that the gene has

fairly stable expression until the 24 hour time point, at which it is strongly up-regulated.

This protein has a catalytic role associated with the generation of nitric oxide generation.

While nitric oxide is used by macrophage cells to kill engulfed pathogens, nitric oxide is

unlikely to be used in this way in an HFF cell line. It is therefore difficult to hypothesise as

to why the protein appears in lower abundance. The protein superoxide dismutase exhibits

unusual results in this study, and is discussed below.

Superoxide dismutase

In general, there appears to be a reasonable correspondence between gene expression and

protein abundance, because most proteins that are found to be up-regulated in the proteome


Figure 6.10: The top images display the spot identified as superoxide dismutase chain Afrom the non-infected sample, replicate 11 (left) versus infected (right). The lower imagedisplays superoxide dismutase. A polygon has been drawn on top of the image to displaythe likely position of the protein in the second gel. Gel images courtesy of M. Nelson.


study have a corresponding gene expression value that is greater than one. In addition, most

of the proteins that are down-regulated have a corresponding gene expression value of less

than one. The one clear exception is superoxide dismutase, which is down-regulated in the

proteome, but strongly up-regulated in the microarray study. The Gene Ontology classifies

the protein as released in response to oxidative stress, which we would predict to be greater

during parasite invasion, therefore the result from 2-DE is surprising.

There are two spots on 2-D gels from non-infected samples, one predicted to match “su-

peroxide dismutase chain A” and another matching “superoxide dismutase”. The automated

comparison predicts that only the latter protein matches the microarray result in the Blader

study. A local alignment of the two protein sequences reveals that they have very low ho-

mology (35% similarity, alignment not shown), indicating that they are not highly related

proteins, even though they have similar names (GenBank accessions gi|515251 and gi|34711).

The diagram in Figure 6.10 displays the positions of the spots on the gels from infected and

non-infected samples. The top image displays superoxide dismutase chain A and the lower

image shows the position of superoxide dismutase, from infected (right) and non-infected

(left) samples. The microarray results demonstrate very strong up-regulation of superoxide

dismutase during infection. It is likely that in the proteome study “superoxide dismutase

chain A” is a different protein, and is not strongly down-regulated. Therefore, considering

only the lower image on Figure 6.10 (superoxide dismutase), there is no clear spot, or only

a spot with a far lower volume, in the infected sample. This result is surprising given the

suggested role of the protein, therefore further analysis is required to verify that superoxide

dismutase is down-regulated during infection in the proteome but up-regulated in the tran-

scriptome. If this proved to be correct, this would demonstrate strong post-transcriptional

control regulating protein abundance, because a large increase in gene expression does not

appear to produce a corresponding change in protein abundance.

In summary, the results of the comparison between microarray and proteomics highlight

the potential for discovery of the relationship between gene expression and protein abun-

dance when larger data sets are assembled. The study reveals several proteins that correlate

well with gene expression values. The examples presented in this section demonstrate that

information about the proteins’ functions can be assimilated easily within RAPAD, due to

the number of links to external databases which are provided.


Figure 6.11: Four spots containing protein disulphide isomerase. The pattern of spots isindicative of different phosphorylated forms of the protein. Gel image courtesy of M. Nelson.

6.3.4 Post-translational modifications

The database query facility and the Gel Viewer enable researchers to find proteins that lo-

calise to the same region on the gel, and share the same name. This can highlight potential

post-translational modifications for further enquiry. An example is shown in Figure 6.11

of four spots matched to protein disulphide isomerase, a protein that catalyses the rear-

rangement of sulphide bonds in proteins. The pattern of several spots in a horizontal line

is characteristic of different phosphorylation states, although other types of variable modifi-

cations can produce clusters of spots. It is also possible that differential splicing occurs to

produce various different protein sequences from a single gene.

Mass spectrometry data was used primarily to identify proteins, however, a process was

undertaken to search the MS data again, to find variable modifications on the proteins.

The MASCOT software has an option to search for different types of modifications, such

as phosphorylation, acetylation, and others, to find if the mass of each peptide detected,

matches more closely a peptide sequence if one of the residues has a particular modification.

The search was implemented for clusters of spots that match the same protein (Vimentin,

PDI and Annexin). However, the searches revealed little information about modifications.


Start - End Observed Mr(expt) Mr(calc) Delta Miss Sequence

63 - 73 1191.58 1190.57 1190.59 -0.02 0 LAPEYEAAATR

95 - 104 1084.56 1083.55 1083.56 -0.01 0 YGVSGYPTLK

108 - 119 1236.51 1235.51 1235.51 0.00 0 DGEEAGAYDGPR

259 - 271 1619.78 1618.77 1618.78 -0.01 0 DLLIAYYDVDYEK

336 - 344 1188.53 1187.52 1187.53 -0.01 0 FVMQEEFSR Oxidation (M)

336 - 347 1652.70 1651.70 1651.66 0.04 1 FVMQEEFSRDGK Acetyl (K); Acetyl (N-term); Oxidation (M); Phospho (ST)

352 - 362 1359.66 1358.65 1358.65 -0.00 0 FLQDYFDGNLK

352 - 363 1515.75 1514.74 1514.75 -0.01 1 FLQDYFDGNLKR

434 - 448 1680.75 1679.75 1679.75 -0.00 0 MDATANDVPSPYEVR Oxidation (M)

449 - 460 1341.68 1340.68 1340.68 0.00 0 GFPTIYFSPANK

449 - 461 1469.75 1468.74 1468.77 -0.03 1 GFPTIYFSPANKK

472 - 482 1370.69 1369.68 1369.69 -0.01 0 ELSDFISYLQR

Figure 6.12: The result of a search for potential post-translational modification of proteindisulphide isomerase, revealing a peptide that may be acetylated and phosphorylated. Theoxidations are caused experimentally and are not biologically relevant.

There are several possible reasons: firstly, the number of peptides detected by MS is usually

far smaller than the total number of peptides in a protein, and only a proportion (10-

40%) of the peptides are actually detected. Therefore, in many cases the modification is

to a peptide that is not detected by MS. Secondly, it is believed that peptides with certain

modifications do not ionise well, and are therefore less likely to be detected than peptides

without additional modifications. Finally, it is possible that the cluster of spots is the result

of several different translations of the same gene, to produce a set of proteins that contain

peptides that still match the sequence entry in the database. The searches revealed a single

possible modification to the PDI spot at the furthest right position in Figure 6.11, which is

predicted to have been acetylated and phosphorylated (Figure 6.12). A phosphorylation to

a protein could be confirmed by a labelling experiment to quantify the number of phosphate

residues per protein, in each spot. There are facilities in RAPAD for the storage and querying

of PTMs after they have been confirmed, as described in the previous chapter.

6.3.5 Public access to data

An interface has been created that will allow public access to the proteomic data to ac-

company a future journal publication. The opening page loads a general description of the

experiment, a summary of all the gels, a listing of the number of proteins identified on each

gel, and links to the protocols for the protein solubilisation, first and second dimension sep-

aration, staining and scanning (Figure 6.13). There is an option to select particular gels,

and view a table containing the proteins that have been identified. The second page allows

users to select particular proteins, and open the Gel Viewer highlighting the proteins, with

different gels appearing in separate tabbed windows. The security of data is ensured because

a check is made before loading each page that data has been specified as publicly accessible

for every gel and protein (every database table has the attribute OTHER READ which is set


from 0 to 1 for public data). At the time of writing, the researchers do not wish to release

the data until it has been published elsewhere, therefore the URL for this part of the inter-

face will accompany the publication of the data. The query interface that forms part of the

RAPAD Study-Annotator, described in the previous chapter, will be linked to the publicly

accessible data sets. This will allow other researchers to verify the findings, and opens the

possibility for new discoveries by allowing complex queries of the data.

6.4 Discussion

The experiments described in this chapter present challenges due to the size and complexity

of the data. One of the major challenges is the requirement for summarising data across

replicates, and determining if proteins are differentially expressed during parasite invasion.

The biological goals were to investigate if proteomics experiments confirmed or conflicted

with previous hypotheses regarding the mechanism of parasite invasion, and the continued

survival of the parasites in host cell culture. This has been facilitated by the development of

software for matching spots between gels, and visualising differentially expressed proteins.

The Gel Viewer enables multiple gels to be loaded concurrently, with controls for zooming,

search facilities for highlighting particular spots and links to more detailed information in

the database. There are also query facilities for finding particular proteins in the database

and for summarising all the data across replicates. Software has been written to connect

RAPAD to a number of external databases and analysis applications that summarise func-

tional classifications, to find which classes of proteins change in expression during parasite

invasion. The ability to connect to external software demonstrates the flexibility of RAPAD

which is due to the extensive use of ontologies. External database entries are stored in a

generic table (DatabaseEntry), with a record stored in the OntologyEntry table that has

sufficient information for capturing how the link to the external database should be imple-

mented, capturing the database’s URL and version. Therefore, external links to any web

accessible database can be provided.

This investigation allows the general functionality of RAPAD to be assessed in a genuine

research environment. It is common for proteomics investigations to require the display of

differential expression of proteins, and links to external Internet accessible databases. The

data set described in this chapter is fairly large (14 gels, 350 identified proteins), and in the

following chapter there is a description of a different study in which a further 1000 protein

identifications are stored in RAPAD. This demonstrates that RAPAD can scale up to manage


Figure 6.13: A summary page displays all the gels present in the experiment, and a linkexists to display the experimental protocols used for each gel.


substantial data sets, and allows them to be queried. The interface code and database schema

are freely available, therefore other developers can re-create their own version of RAPAD to

support a variety of proteome studies.

The integration between the locally generated proteomics data and the previously pub-

lished microarray studies was a critical requirement of the project. The results demon-

strate the viability of the approach, however currently there are few protein records that

are matched to microarray data points. This appears to be a reflection of the proportion of

records present in both studies, rather than a flaw in the methodology. The microarrays used

by Blader contained from 18,000 to 27,000 clones. However, the results were only reported

for those genes that showed a 2-fold difference in fluorescence between scans generated from

infected and non-infected samples, corresponding to approximately 1800 microarray results.

In the proteome study, 130 distinct proteins displaying differential expression have been iden-

tified, of which 14 have been matched to a clone in the microarray study. This is about 1 in

9 proteins that match a microarray clone. It would be expected by chance that a minimum

of 1 in 15 proteins identified by 2-DE should match a clone in the microarray results (1800

results from 27000 clones = 1/15). In reality, we would expect a far higher number to corre-

spond in the two studies because protein spots have been selected for analysis if they appear

in different volumes across the two conditions. It is assumed that if a protein is produced in

much greater abundance, there would be a corresponding increase in the mRNA levels that

would be detected by microarray analysis. Therefore, it might be predicted that most of the

proteins identified in the proteome study should appear in the Blader results.

There are a large number of the 130 proteins found in this study that do not match any

differentially expressed genes in the microarray study. It is possible that the Blader study

did not have complete coverage of all genes, but the majority of genes were assayed and were

found to have stable gene expression between infected and non-infected samples. Therefore,

the differentially expressed proteins that do not match anything in the Blader study are of

interest, because they demonstrate that there may be post-transcriptional control in response

to parasite invasion. In other words, many proteins are produced in greater or lesser volume

during infection that do not have a measurable difference in their mRNA levels. It is possible

that certain proteins that are required for infectivity would not be highlighted from a gene

expression experiment, and in that case the mechanisms for infectivity cannot be studied

using microarrays only. This finding demonstrates the viability of the 2-DE and MS approach

for hypothesis formation, and it is likely that it will continue to grow as a technology for


functional genomics analysis.

6.5 Summary and conclusions

In this chapter software has been described that enables clustering and visualising spots on

replicate gels that contain the same form of a protein, and the spots that contain variant

forms. This has enabled potential post-translational modifications to be identified for fur-

ther study. When PTMs have been confirmed, RAPAD has facilities for their storage and

querying. The results suggest that different forms of proteins exist in infected and non-

infected samples, although the exact types of the modifications have yet to be confirmed.

The data sets will continue to grow rapidly, and it will be vital to combine information about

modifications with the relative expression values measured by microarrays, 2-DE and other

technologies. RAPAD provides a framework in which this kind of data integration can take

place on a large scale, and it will serve as a repository for the publication of data to accom-

pany journal articles. It is planned that the data from the experiments with Toxoplasma

will be published at some point in the future. RAPAD will provide public access to the

data, using the interface described in the previous chapter, to allow researchers accessing the

article to query the proteome data.

A common type of proteome investigation is the search for differentially expressed pro-

teins, using 2-DE, image analysis and mass spectrometry. The RAPAD system has been

extended to support the experiments presented in this chapter, which compare a human cell

line, invaded with Toxoplasma gondii, with non-invaded cells. RAPAD specifically facilitates

the identification of differential expression by providing a visualisation of clusters of spots

that have been matched to the same protein across a series of replicates. Following the

identification of proteins, a large amount of information must be assimilated from diverse

databases to characterise the proteins. Every protein record in RAPAD has hyperlinks to sev-

eral other databases, using the GenBank identifier or the corresponding gene symbol, which

were obtained for each protein using scripts written by the author. Additional tools were

used to summarise the functions of proteins from the Gene Ontology. An approach has been

presented for matching differentially expressed proteins to the corresponding results from a

previously published microarray experiment. The results of the matching demonstrate some

correspondence between genes that are up-regulated during infection and increased protein

abundance on 2-D gels, but the data sets are not currently large enough to quantify the cor-

relation. The software can be re-used when data sets are larger for determining the global


rate of transcription and translation.

The following chapter outlines a project with a different parasite, Trypanosoma brucei.

RAPAD assists an investigation to catalogue all the proteins that can be found using a

gel-based approach, to improve the functional annotation of the genome, and determine the

dynamic nature of the proteome.

Chapter 7

Software support for a proteome

map of Trypanosoma brucei

7.1 Introduction

The previous chapter focused on the use of proteomics techniques to find differentially ex-

pressed proteins that allow for the formation of new hypotheses about the function of a

system. This chapter outlines the use of proteomics in a different context, where it is used

for cataloguing information about protein expression, to improve the functional annotation

of genes and the search for post-translational modifications. The RAPAD database sup-

ports a proteome map of the parasite Trypanosoma brucei that causes sleeping sickness in

Africa. The genome sequence of T. brucei is nearing completion from which many open

reading frames have been accurately predicted, but the functional annotation of the genes

is generally poor. There are many genes that have only been tentatively identified and have

no functional assignment. The proteome data is able to confirm the existence of genes that

encode proteins expressed in the cell line and provide insights into the dynamic nature of

proteins in terms of modifications, and different isoforms that exist. Additional software

has been written to provide a novel visualisation of proteins identified by mass spectrome-

try, and to summarise information within a substantial data set. The analysis presented in

this chapter will improve the naming of certain genes, and provides a potential functional

assignment for several proteins.

7.1.1 The biology of trypanosomes

Trypanosoma brucei is a eukaryotic parasite that causes sleeping sickness in sub-Saharan

Africa, and there have been a number of recent epidemics [294]. Trypanosomes live in

the bloodstream and tissue fluids of mammals, causing a variety of diseases in livestock and

214

Chapter 7. Software support for a proteome map of Trypanosoma brucei 215

Figure 7.1: The life cycle of Trypanosoma brucei, from DPDx - CDC Parasitology Diagnosticweb site, http://www.dpd.cdc.gov/dpdx/HTML/TrypanosomiasisAfrican.asp

mortality in humans. They are transmitted by tsetse flies, and it is predicted that more than

half a billion people live in affected areas, with hundreds of thousands of new cases per year

[26]. The expected outcome, in the absence of chemotherapy, is death. Anti-trypanosomal

drugs have been developed, although drugs are not 100% effective, and resistant strains are

now arising [301].

The prospects for the development of a vaccine are very slim because the parasite evades

the immune response through the process of antigenic variation, first reported by Vickerman

in the 1960s [335]. A set of proteins, known as variant surface glycoproteins (VSG), form

a dense outer layer around the parasite, protecting against recognition from the immune

system. There is one locus from which a single VSG gene is activated at any one time,

with approximately 1000 other VSG genes distributed in different, silenced positions. At

intervals, a rearrangement of the genes occurs, switching the gene that is positioned in the

activated locus. A different protein becomes expressed, forming a new surface coat that will

not be recognised by the immune system (the mechanisms of gene switching are reviewed by

Barry 1997 [27]).

Trypanosomes undergo a complex developmental cycle that is simplified in Figure 7.1.


Figure 7.2: An electron micrograph of the bloodstream form of Trypanosoma brucei, fromhttp://www.ulb.ac.be/sciences/biodic/ImProto0003.html

The regulation of the life-cycle is poorly understood despite its obvious importance to the

parasite. When a fly takes a blood meal from an infected mammalian host, bloodstream

forms (Figure 7.2) differentiate to the procyclic stage of the life cycle in the gut of the fly,

accompanied by alterations in metabolism and morphology caused by changes in expression

of an unknown number of proteins. It is vital these proteins are identified given the severity

of the disease and the unusual biology of trypanosomes, which is discussed in more detail

below. It is also possible that proteins involved in regulating the life-cycle may prove to be

viable drug targets.

7.1.2 Annotating the genome

The genome sequence of T. brucei is nearing completion and the sequence of chromosomes I

and II was reported in 2003 [146, 87]. The genome contains 11 chromosomes in total, and is

27 Megabases in length. Currently, 5500 coding sequences have been conclusively identified

(March 2004) [127], and it is expected that the total gene number will be about 8000. Efforts

are now underway to determine the function of all genes, with particular focus on genes that

cause drug resistance, genes that enable the parasite to evade the immune response and the

proteins that are up-regulated during infection of mammals. Trypanosoma brucei belongs

to a small class of unicellular organisms, the kinetoplastids, which exhibit highly unusual


regulation of gene expression. It seems that these organisms do not regulate transcription

by RNA polymerase II, and large numbers of genes appear to be regulated from a single

transcriptional initiation point. The genes lie adjacent to each other in long runs, interspersed

with almost no introns, similar to bacterial operons [60]. However, unlike operons, the

genes do not encode similar proteins that would be expected to be under a single control

mechanism, but instead contain seemingly unrelated genes. It will therefore be interesting

to discover what functional genomics (FG) experiments can demonstrate about how genetic

regulation is performed in these parasites. Microarray analysis would be expected to reveal

unusual results because transcriptional control may occur only through regulation of the

rate of degradation of mRNA, or the rate of splicing. Therefore, the abundance of mRNA

may have different patterns from organisms with conventional gene regulation. Proteomics

studies aim to determine the level of expressed proteins and therefore may prove vital in

elucidating how post-transcriptional control is exerted.

It is essential that the functional annotation of the T. brucei genome is improved rapidly,

and made widely available, to facilitate the search for new drugs to control sleeping sickness.

There are also several related species that cause serious diseases. One of the closest relatives is

Trypanosoma cruzi that causes Chagas disease in South America. The parasite is transmitted

by triatomal bugs, infects mainly cardiovascular and autonomic nervous tissues, and is fatal

in about a third of all cases [53]. There are several members of the genus Leishmania, which

cause a variety of life-threatening diseases in the third world. Genome sequence is taking

place on T. cruzi and Leishmania major. Comparative genome studies must be performed

to ensure that any gene annotations for closely related species can be related back to newly

sequenced genes in other organisms.

7.1.3 Database support

RAPAD is supporting a project to generate a catalogue of all the expressed proteins from T.

brucei, which can be separated by two dimensional gel electrophoresis (2-DE) and identified

by mass spectrometry (MS). The experiments are being performed by Anne Faldas and Prof.

Mike Turner in the Institute of Biomedical and Life Sciences at the University of Glasgow,

and the biological data in this chapter is reproduced with their permission.

Many of the 8000 genes in the genome are annotated as “hypothetical proteins” because

they been identified solely by gene prediction algorithms. A naıve search of the genome

database, GeneDB [127], for the annotation “hypothetical AND protein” in T. brucei pro-


duces a list of 11,999 entries, for which there is little or no further annotation. Several

entries must refer to the same underlying gene, but appear more than once in the database,

because this number is far larger than the expected total number of genes. Clearly, if a

protein is identified conclusively by mass spectrometry, the protein is a real sequence, and

is expressed under the conditions used to generate the sample. This information must feed

back to the genome curators to allow the annotation to change from “hypothetical protein”

to “confirmed protein”. If homologous sequences from other organisms have been found by

similarity searches, the functional assignment of the homologous sequence should also be

added as annotation (described in Section 7.3.2).

RAPAD supports searching and filtering of proteome data, allowing complex Boolean

queries to mine specific information from large data sets. It is also important that protein

data arising from gels with different pH ranges is combined in an intuitive manner, requir-

ing the development of good visualisation tools. This facility in RAPAD was described in

the previous chapter. One of the most important parts of the analysis is to discover the

frequency of post-translational modifications, or other events, which cause multiple spots,

matched to the same protein to appear on a gel. Many proteins appear in multiple copies at

different positions on the gel, indicating that some processing or alteration of proteins must

be occurring to change either the charge or mass. For example, 92 distinct spots contain a

tubulin protein (α or β), many of which appear near the base of the gel, indicating small

molecular weight proteins, and the spots are reproducible across replicate gels. This would

suggest that the spots contain only fragments of proteins, the result of degradation. Software

has been developed alongside RAPAD to investigate this phenomenon (Section 7.3.1).

7.1.4 Project status

The current status of the T. brucei data deposited in RAPAD is as follows (June 2004).

There are 955 proteins identified in total, which arise from 619 spots on three gels. The

number of proteins is higher than the number of spots because several different proteins

are frequently identified from a single spot. A database query reveals that 260 proteins

have distinct molecular weights, indicating that this is the approximate number of different

proteins that have been identified. The rest of the analysis has been performed on one single

master gel (pH range 4-7), which contains 879 distinct spots. On the master gel 753 protein

identifications have been made from 460 spots.

The rest of the chapter is structured as follows: the methods used to capture the project


requirements and to develop the software are discussed in Section 7.2. Section 7.3 describes

the results, in terms of how RAPAD supports the discovery of modifications and aids genome

annotation. An investigation into the causes of multiple spots arising for a single protein is

also described. Discussion is provided in Section 7.4.

7.2 Methods

7.2.1 Generation of samples for proteome analysis

One of the major problems of performing functional genomics analysis on trypanosomes is

the speed with which they evolve, and it has been reported that trypanosome lines can spon-

taneously change their phenotype as a result of laboratory manipulation (see for example

van Deursen et al. 2001 [329]). If researchers perform investigations to characterise the

gene or protein expression of trypanosomes, the results may only have relevance to the exact

laboratory strain on which the experiments were performed. To alleviate these problems, a

reference strain of T. brucei has been generated (TREU 927), which has been used for gen-

erating the genome sequence [329]. The strain has several properties that are representative

of trypanosomes in the wild, and it can be cultured in vitro. Proteins have been extracted

from procyclic forms of the TREU 927 line grown as an in vitro culture for the proteome

study in Glasgow. This is vital because DNA has also been extracted from this line for

microarrays that are being created. Therefore, it will be possible to compare data from

the genome, transcriptome and proteome in the future, and the proteome experiments can

directly contribute to improving the annotation of the genome. The proteomics experiments

in the database comprise three main gels which have been run over different pH ranges (4-7,

6-11, and 4.5-5.5) to achieve a high resolution of proteins. The experimental protocols for

protein solubilisation, the two dimensions of gel separation and staining are all stored in

RAPAD. The details of the experimental procedure are given below.

Procyclic forms of the genome reference strain TREU 927/4 were grown in SDM-79

with 10% foetal calf serum according to [330]. Parasites were purified by washing in PSG

buffer and centrifuged at 13,000g. Approximately 2x108 trypanosomes (650 µg protein) were

solubilised in 470µl lysis/rehydration buffer (9M urea, 2M thiourea, 2% CHAPS, 65mM DTT

and 0.5% IPG buffer pH4-7, trace bromophenol blue). A protease inhibitor cocktail (5µl,

Roche), at a concentration of 25µg/ml, and 10µl nucleases (2000 units/ml DNase, 1750

units/ml RNase A, 50mM MgCl2) were added to limit proteolysis and digest nucleic acids.


The sample was incubated at room temperature for 1 hour, vortexing every 10 minutes, then

freeze/thawed in liquid nitrogen. The sample (450µl) was loaded on to a 24 cm IPG strip

(Amersham) and isoelectric focusing was performed, reaching more than 70,000Vhrs.

The strips were equilibrated in 100mM DTT for 15 minutes followed by 15 minutes in

250mM α-iodoacetamide before being applied to a 12.5% precast SDS polyacrylamide gel.

Electrophoresis ran over night at 150C using the Amersham buffer kit. The gels were stained

using colloidal Coomassie dye and scanned using Image Master (Amersham). Replicate gels

were performed (ten replicates of pH 4-7, five replicates of 4.5-5.5 and 6-11) of which one was

selected for protein identification. The 2D Elite software (Amersham) was used to generate

a picklist, and the gel was transferred to the Amersham robotic workstation, each gel plug

digested with trypsin and mixed with a CHCA (α-cyano-4-hydroxy cinnamic acid) matrix,

and spotted on to a MALDI (Matrix Assisted Laser Desorption Ionisation) target plate. A

peptide sample and a gel plug were collected for each sample and stored at −200C. Analysis

of the peptides were performed using MALDI-TOF (Time Of Flight) with a Voyager system

(Perseptive Biosystems) and tandem MS (AB Q-Star Pulsar). Tandem MS was used for the

majority of protein identifications (approximately 95%). Genome sequence information was

downloaded from GeneDB (Release 3) to a local database that was searched using MASCOT

software [207]. Proteins were positively identified at a significance value of P < 0.05 as

calculated by the software.

7.2.2 Project requirements capture

The first phase of developing an understanding of the problem area involved meetings with

the project leader and researchers working on trypanosomes. The current practice of man-

aging data was observed. This consists of the data from the project being stored in Excel

spreadsheets. Data was entered into the spreadsheet by manual copy and pasting from mass

spectrometry results and database searches that had been performed to characterise pro-

teins. Protein data in the Excel spreadsheet was related back to the spot on the 2-D gel

from which it arose, using the numerical identifier assigned to the spot by the image analysis

application, which was entered in the corresponding row of the table.

The project leader, Prof. Mike Turner, outlined a set of six questions that could poten-

tially be solved by improvements in software:

1. Can the time and labour to identify proteins be reduced?

2. How many different proteins can be identified from 2500 spots?


Protein unfoldsduring 2−DE

Digested into peptides

Peptides detected by MS

Peptide span ofwhole sequence

Folded protein

Figure 7.3: The span of peptides that have been matched within a protein sequence arerepresented by the shaded section of the block, for a cluster of four spots, explained inSection 7.2.3.

3. How widespread and common are post-translational modifications?

4. How can we improve the T. brucei genome annotation?

5. Can we build a “point and click” virtual 2D gel?

6. Can we build pages that give original MS data interpretations?

The issues of genome annotation and data integration were discussed in meetings with the

curators of the T. brucei genome database at the Sanger centre, Cambridge UK (December

2003). The web site providing access to the genome is GeneDB, which is supported by the

GUS database system. One of the main goals of the proteome project is to improve genome

annotation. Once the proteome namespace has been added to GUS, as discussed in Chapter

5, the proteomics data can be stored directly within GeneDB. However, it is important that

data produced from the experiments can be linked up with GeneDB in the near future, prior

to the full deployment of a new version of GUS that supports proteomics. Towards this

goal, a new interface has been developed as part of RAPAD for publishing data, with unique

identifiers that can be linked up with GeneDB, when the proteomics data is made public.


7.2.3 Visualisation

There are many different spots that have been identified as the same protein on the master

gel in this investigation. The database can be queried for a particular protein name, and

the results of the query can be visualised in the Gel Viewer. The Gel Viewer provides a link

from each spot to the record for the mass spectrometry results that were used to identify

the protein. However, there are limited facilities for investigating why so many different

spots arise that appear to match the same protein. Therefore, additional software has been

implemented alongside RAPAD for visualising the peptide sequences that have been matched

by MS data, to investigate why certain proteins appear in multiple positions on a 2-D gel.

A piece of text processing software has been written to extract the peptide sequences from

mass spectrometry results. The full length sequence has also been obtained for each protein,

and linked up to the Gel Viewer to provide a visualisation for every spot, displaying the

proportion of the protein sequence that has been matched: the span of peptide hits (Figure

7.3). Each spot is labelled with a white block representing the entire protein sequence, filled

with a shaded section. The left end of the shaded block represents the position of the first

peptide hit in the protein sequence, the right end of the shaded block represents the last

position of the last hit to the protein sequence. From this information, it is possible to say

that at least this proportion of the protein sequence was present in the spot, assuming correct

identification from MS data. Peptides may not be detected by MS for several reasons: (i)

during MS/MS only a proportion of the peptides most strongly detected in the first stage

are subjected to the second stage of MS, (ii) ionisation is dependent on various properties of

a peptide, such as its charge and (iii) there is technical variability in the efficiency of peptide

ionisation.

The genome database contains several different genes that share the same name. An

additional visualisation has been created to summarise where these different forms of the

same protein arise on the master gel. A different colour is used to shade spots that have

been matched to peptides within a specific protein sequence in the database. In this way,

groups of proteins that have the same name but are in fact different, can easily be visualised

on the gel. This allows researchers to verify that clusters of proteins with the same name

have been identified correctly, because it is expected that proteins located in the same region

of the gel will arise from the same gene.


7.3 Results

7.3.1 Investigation into multiple protein forms

The proteomics experiments on T. brucei reveal several proteins that appear in multiple

positions on a single gel (the pH 4 - 7 gel), examples include Heat Shock Protein 70 (62 spots),

α-tubulin (50 spots), β-tubulin (40 spots), Elongation factors (EF 1-α, EF 1-β, EF 1-γ, EF

2; creating 37 spots in total) and Heat Shock Protein 60 (19 proteins). There are several

reasons why proteins may appear in multiple positions. Firstly, chemical modifications, such

as the gain or loss of phosphate groups on the protein, can cause multiple spots to appear in

a localised region. Secondly, a protein may also be fragmented at some point, either in vivo

or during the experimental procedure, therefore peptides measured by mass spectrometry

may not have arisen from the full protein sequence. Protein spots that arise near the bottom

of the gel, indicating low molecular weight proteins (described on page 7 in Chapter 1),

are more likely to contain only fragments of proteins . Thirdly, it is formally possible that

differential splicing causes different proteins to be produced from the same gene, which still

have peptides that match the protein entry in a sequence database, even if the full length

sequence of the protein is different from the predicted form. However, while differential

splicing in higher eukaryotes seems to be a very common phenomenon [215], it has never

been reported in T. brucei because almost all genes comprise a single exon and therefore

are not spliced at all. It is also possible that the proteins which seem to appear in multiple

copies are false positives, arising because the sequences have some characteristic that causes

many incorrect database matches.

Tubulin proteins

α and β-tubulin produce many spots on the 2-D gels for T. brucei, which could be the

result of protein modifications. α and β-tubulin form a heterodimer and are one of the main

components of microtubules that form a layer around the cytoplasm, just beneath the outer

cell membrane [140]. A study by Lubega and colleagues demonstrated that mice can be

immunised against African trypanosomosis by injection with tubulin proteins, raising the

possibility that tubulins could form part of a successful vaccine [197].

It has previously been demonstrated that post-translational modifications (PTMs) of

tubulin are associated with the construction of the cytoskeleton and fall into two categories:

general protein modifications, such as phosphorylation or acetylation, and tubulin-specific,


1)

2)

3)

4)

Figure 7.4: Protein spots matched to β-tubulin, overlaid with a graphic displaying the spanof peptide hits (shaded block) as a proportion of the full length sequence (white block). Theboxed regions are discussed further in the text. Gel image courtesy of A. Faldas.

such as tyrosination. The acetylation of tubulin has previously been identified by 2-DE,

therefore many of the spots observed in this study are likely to correspond to differentially

modified forms of the protein (original experiments are reviewed by Gull [140]).

β-tubulin

The results from the peptide alignment analysis with β-tubulin are displayed in Figure 7.4.

The main cluster of proteins (1 on Figure 7.4) towards the top of the gel is in the position

that would be predicted by the molecular weight of β-tubulin (50KDa). It is likely that there

are several different types of chemical modifications that occur to β-tubulin, causing the 16

different spots to appear in this region. The spots at the bottom left of the gel (4) have fairly


short spans of peptide hits (less than 10% of the full sequence), therefore are more likely to

be caused by peptide fragments. In the bottom middle range of the gel there are two spots

(3) both with peptides matching a range in the middle of the protein sequence, indicating

these two are caused by two similar protein fragments, possibly with a single modification

causing a localised shift in position.

There is a cluster of several spots in the middle/left of the image (2 on Figure 7.4),

which appear to have very long peptide spans (up to 80%). This result is surprising because

it would not be expected that the full length protein sequence for tubulin would migrate this

far into the gel. Therefore, it is theoretically possible that this protein arises from differential

splicing of gene products to produce a protein that has peptide sequences from the two ends

of the original sequence. It is also possible that the spots contain a different protein that

has peptides that closely match parts of the β-tubulin sequence. However, a BLAST [11]

search of GeneDB with the peptides from these regions reveals that there are no similar

sequences except the other tubulin proteins (BLAST results not shown). The MS data for

the close groupings of three spots (spot ID 677, 664 and 641) have very high MASCOT scores,

indicating that the matches are probably correct, with strong hits to peptides near the start

of the sequence, and other matches to peptides near the end of the protein sequence. GeneDB

contains a cluster of identical genes on chromosome 1, annotated as β-tubulin, although the

exact number of genes is not known because it varies in different cell lines. It is also very

difficult to assemble regions of the genome that contain repetitive identical sequences. There

are no gene sequences deposited in GeneDB that could explain the long span of peptide hits

of this spot cluster.

A further observation on Figure 7.4 is that the peptides matched tend to cluster at the

N-terminus (left end) and there are no peptides matched to the C-terminus (right end) of

protein sequences. This raises the possibility that there is cleavage of a peptide at the C-

terminus. Alternatively, it is possible that there are modifications that prevent peptides

being ionised in a mass spectrometer. In particular, it is known that the C-terminus of

β-tubulin is extensively glutamylated, which is the addition of up to 20 extra glutamate

residues to a defined glutamate near the C-terminus [283]. This may prevent the peptides

at the C-terminus from being detected by MS.


1)

2)

4)5)

3)

Figure 7.5: Protein spots matched to α-tubulin, overlaid with a graphic displaying the spanof peptide hits. There is a correlation between the span of peptide hits and the position ofa spot on the gel. Gel image courtesy of A. Faldas.

α-tubulin

Figure 7.5 displays the peptide spans for α-tubulin. The cluster of six spots towards the top

of the gel (1) are in the position that would be expected by a protein with the molecular

mass of α-tubulin (50KDa) and therefore probably contain the full length sequence. There

is a cluster of spots presumably caused by various small modifications to the protein, which

account for the localised shifts in positions. The genome contains a cluster of identical α-

tubulin sequences on chromosome 1, therefore the different spot positions are not due to

differences in gene sequence.

At the bottom of the gel there are a large number of possible fragments, and there appears

to be a fairly strong correlation between spots located in the same region and the span of

peptide hits (see for example 2, 3 and 4 on Figure 7.5). This would suggest that a fragment is


being produced reproducibly with one or two different modifications on the peptides present

in the fragment. The volume of spots in the small molecular weight range also appears to

be reproducible across replicate gels by manual inspection. However, it is not possible to

investigate the peptide spans of all spots from replicate gels by MS due to the cost involved.

It remains to be investigated if these fragments have any biological significance or if they

are experimental artifacts. The correlation between peptide span and spot position may be

related to protein modifications. Modification status affects the ability of a peptide to be

ionised, therefore peptides that have the same set of modifications should have the same

probability of being detected by mass spectrometry. Proteins located in similar regions are

likely to contain many peptides that have been modified in the same way, and these peptides

will share the same likelihood of being detected by mass spectrometry.

There are two spots towards the bottom left that have very long spans of matched

peptides (5). This is similar to the results for β-tubulin, and it is unlikely that a full length

protein could migrate this distance in the gel, therefore these may be the result of differential

splicing. An alternative, although unlikely, possibility is that tubulin fragments from the two

ends of the protein have independently co-migrated and appear as a single spot. It is also

possible that the protein fragmented but the 3-D structure did not completely disassociate

as expected, leaving different parts of the protein bound together, with a small overall mass.

The spots (IDs 741 and 734) both have strong hits to the α-tubulin protein record, matching

peptides near the beginning and end of the protein sequence. Additional experiments could

be performed to further characterise this protein spot, for example performing tandem mass

spectrometry on as many peptides as possible to determine what parts of the protein are

present in the spot.

The same observation about the lack of peptides matched at the C-terminus can be made

for α-tubulin, as well as β-tubulin. This may be due to glutamylation, which has also been

reported for α-tubulin [84], or tyrosination of C-terminal peptides [289]. Modifications of

these kinds are thought to be common on α-tubulin, and may prevent peptides becoming

ionised during MS. The peptide spans on Figure 7.5 also demonstrate that there are no

N-terminal peptides that have been matched. This raises the possibility that PTMs also

occur on N-terminal peptides, which as far as we aware has not been previously reported.

This demonstrates that the peptide visualisation software has the capacity for hypothesis

generation, which can be confirmed by further experimentation.


Figure 7.6: Protein spots matched to five different Elongation Factors. EF-α (blue); EF-β(red); EF-2 (yellow); EF-γ (orange and boxed); EF (putative) (white). Gel image courtesyof A. Faldas.

Elongation factor proteins

The peptide alignment analysis has also been performed to classify Elongation Factor (EF)

protein spots. Elongation factors function during protein translation, for example controlling

the addition of new amino acids onto a growing peptide chain. It has been suggested that T.

brucei protein abundance is controlled at the level of translational rather than transcription,

therefore any insights into EF proteins could prove important in understanding regulation.

There are at least five different elongation factor genes, with many spots appearing on the

2-D gel (Figure 7.6). Functional annotation for these genes in T. brucei is still at an early

stage, therefore any information from proteomics that can aid annotation will be useful.

An analysis was carried out to determine the peptide spans of EF 1-α, EF 1-β, EF-2, a

sequence annotated in the database solely as EF (putative), and EF-γ (one protein spot -

peptide alignment not shown), to test whether spots have been correctly identified on gels,

and to determine whether sequences have been correctly predicted in the genome database.


Figure 7.7: Protein spots matched to Elongation factor 1-α. Gel image courtesy of A. Faldas.

Elongation factor 1-α

The graphic for Elongation factor 1-α (Figure 7.7) displays a large cluster of spots to the

right of the gel, likely to be caused by multiple differentially modified forms of the proteins.

The post-translational modification of EF 1-α is a common phenomenon in other organisms,

such as plants [265], but as far as we are aware, it has not been investigated in detail for

trypanosomes. The evidence presented here suggests that PTMs to EF 1-α from T. brucei

are also very common. The spots towards the bottom of the gel are likely to be protein

fragments, shown by the very short spans of peptides (less than 5% of the sequence length).

Elongation factor 1-β and EF (putative)

The left gel in Figure 7.8 displays Elongation factor 1-β (EF-β) protein spots. There are

three spots in the middle of the gel, which are likely to result from different modifications,

such as different phosphorylations to the protein. A single spot towards the bottom of the

gel is probably a fragment of the full length sequence. The right image on Figure 7.8 displays

the spots matched to EF (putative). There is probably one match to the full protein, in the

centre of the gel, and two possible fragments at the bottom of the gel. A multiple alignment

has been performed, using ClustalW [318], of the sequences of EF-β and EF (putative) from


91.m00148 MCDHMYSPVFIPFAFFSIVKCHNKCSFVCNRSGKDMSIKDVNVKSGKLEE 50

gi|461992|sp|P34827|EF1B_TRYCR −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−MSVKDVNKRSGELEG 15TRYP_x−70a06.p2kb545_154 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−MSSLKEIN−−−−−−−G 9gi|310944|gb|AAA30183.1| −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−NSARVKDAMTTLKELNG 17 :*:

91.m00148 KLKGKLFLGGVKPSEEDVKAFNDLLGGDNTNVFRWVKNIASFTEAERTAW 100gi|461992|sp|P34827|EF1B_TRYCR KLKGKLFLGGTKPSKEDVKLFNDLLGAENTSLYLWVKHMTSFTEAERKAW 65TRYP_x−70a06.p2kb545_154 RLSAQPYVSGFTPSKEDARIFSEMFG−SNTAVIQWAARMAAYYQAER−−− 55gi|310944|gb|AAA30183.1| RLSSQPYVSGYCPAR−KTRRYSLRCS−−−ARLALWLSGPHVWLRTIKR−− 61 :*..: ::.* *:....: :. . ..: : * : .: :

91.m00148 GAPVKITPPVAAPVAAPAAAPAAAPAATPARKAAEADDDDIDLFGETTEE 150gi|461992|sp|P34827|EF1B_TRYCR GAPVKVTATTSA−−SAPAKQAPKKAASAPAKQADE−−DEEIDLFGEATEE 111TRYP_x−70a06.p2kb545_154 −−−VQLTK−−−−−−−−−GATASKTSATTKAAAGDD−−−DDIDLFGEATEE 90gi|310944|gb|AAA30183.1| −−TEQILK−−−−−−−−−−−−−GTASSSKKAAAAED−−−EDIDLFGEATEE 93 :: . . .:: * . : ::******:***

91.m00148 ELAALEAKKKKDAAAKSTKKVIIAKSSILFDIKPWDDTVDLQKLATELHA 200gi|461992|sp|P34827|EF1B_TRYCR ETAALEAKKKKDTDAKKAKKEVIAKSSILFDVKPWDDTVDLQALANKLHA 161TRYP_x−70a06.p2kb545_154 ELAALEAKKKKDAAAKSSKKVIIAKSSILFDIKPWDDTVDLDGLAQKLHA 140gi|310944|gb|AAA30183.1| ETAALEAKKKKDADAKKAKKEVIAKSSILFDVKPWDDTVDLQALADKLHA 143 * **********: **.:** :*********:*********: ** :***

91.m00148 IKRDGLLWGDHKLVPIAFGVKKLQQLVVIEDDKVSGDDLEEMIMSFGDAV 250gi|461992|sp|P34827|EF1B_TRYCR VKRDGLLWGDHKLVPVAFGVKKLQQLIVIEDDKVLSDDLEELIMSFEDEV 211TRYP_x−70a06.p2kb545_154 IKRDGLLWGDHKLVPIAFGVKKLQQLVVIEDDKVSGDDLEEMIMSFGDDV 190gi|310944|gb|AAA30183.1| VKRDGLLWGDHKLVPVAFGVKKLQQLIVIEDDKVSSDDLEELIMSFEDEV 193 :**************:**********:******* .*****:**** * *

91.m00148 QSMDIVAWNKI 261gi|461992|sp|P34827|EF1B_TRYCR QSMDIVAWNKI 222TRYP_x−70a06.p2kb545_154 QSMDIVAWNKI 201gi|310944|gb|AAA30183.1| QSMDIVAWNKI 204 ***********

>91.m00148 |||25 kDa elongation factor 1−beta, putative|t_brucei|chr_4|RPCI93|26G5|91>gi|461992|sp|P34827|EF1B_TRYCR 25 KD ELONGATION FACTOR 1−BETA (EF−1−BETA) (T. cruzi)>TRYP_x−70a06.p2kb545_154 |||elongation factor, putative|Trypanosoma brucei||chr 10|||Manual>gi|310944|gb|AAA30183.1| elongation factor (T. cruzi)

EF−beta EF (putative)

Key

Figure 7.8: Protein spots matched to EF-β and EF (putative) are displayed with the corre-sponding span of peptide hits. The boxed regions mark a spot that contains peptides thatmatch both EF-β and EF (putative). A multiple alignment is also displayed of EF-β fromT. brucei and T. cruzi, with EF (putative) from T. brucei and EF T. cruzi. The boxedregion of the alignment shows that the starting codon of EF-β from T. brucei may havebeen wrongly predicted. Gel images courtesy of A. Faldas.


T. brucei and from T. cruzi (in the lower part of Figure 7.8). There is a very high degree of

sequence similarity between EF-β and EF (putative), with long stretches of identical residues.

In T. brucei the sequences lie on chromosome 4 and chromosome 10, therefore there is a low

chance that this is an annotation error and they are in fact the same sequence. However, it

is known that contamination has been detected in sequences derived from the chromosome

10 project, and therefore it is not possible to say definitively that the two sequences arise

from different genes.

The alignment shows that the N-terminus of EF-β may have been incorrectly predicted

because the first 30 or 40 residues align poorly, and there is a region 37 residues downstream,

which matches the start of the other EF sequences. It is also worth noting that the first

residue of the T. cruzi EF sequence is not a methionine and may also have been incorrectly

predicted. There is a methionine nine residues downstream that aligns very well with the

start codon of EF (putative) from T. brucei, which is more likely to be the correct start

position.

The alignment of peptide sequences against proteins reveals a single spot that contains a

peptide that exactly matches the protein sequence of both EF-β and EF (putative), towards

the bottom left corner of the gel (boxed in Figure 7.8). This finding, and the high sequence

similarity on the multiple alignment, demonstrates that mass spectrometry results for EF

(putative) and EF-β cannot always conclusively identify between these two proteins. How-

ever, the spots in the middle of the gel have long peptide spans that cover the N-terminus of

the protein sequence, which is more divergent than the C-terminus of the sequence between

EF-β and EF (putative). Therefore, these spots are likely to have been correctly identified.

Elongation factor 2

The image in Figure 7.9 displays the peptide spans of proteins matched to EF-2. There are

eight spots near the top of the gel which are probably differentially modified forms of the

complete protein, and the spots at the bottom of the gel are likely to be protein fragments.

There is no T. brucei EF-2 sequence deposited in GenBank as of May 2004, but there is an

EF-2 gene in GeneDB. The closest match in GenBank is Elongation Factor 2 from T. cruzi.

A sequence alignment reveals that Elongation Factor 2 is almost identical between T. brucei

and T. cruzi, indicating that the sequence has been correctly named. The last part of the

alignment is displayed in the lower part of Figure 7.9, and it appears that the end point of

the T. brucei sequence may have been incorrectly predicted.


TRYP_x−70a06.p2kb545_355 AIHRGGGQIIPTARRVFYACCLTATPRLMEPMFQVDIQTVEHAMGGIYGV 750gi|1800107|dbj|BAA09433.1| AIHRGGGQIIPTARRVFYACCLTAAPRLMEPMFQVDIQTVEHAMGGIYGV 721 ************************:*************************

TRYP_x−70a06.p2kb545_355 LTRRRGVIIGEENRPGTPIYNVRAYLPVAESFGFTADLRAGTGGQAFPQC 800gi|1800107|dbj|BAA09433.1| LTRRRGVIIGEENRPGTPIYNVRAYLPVAESFGFTADLRAGTGGQAFPQC 771 **************************************************

TRYP_x−70a06.p2kb545_355 VFDHWQQYPGDPLDPKSQANTLVLSIRQRKGLKPDIPGLDTFLDKL 846gi|1800107|dbj|BAA09433.1| VFDHW−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 776 *****.. .... ...:.:.: : . .. ... .. .: ..

TRYP_x−70a06.p2kb545_355(T. brucei)gi|1800107|dbj|BAA09433.1| (T. cruzi)

Figure 7.9: The span of peptide hits for protein spots matched to Elongation Factor 2. Thealignment shows the 150 residues at the C-terminus of the EF-2 sequences from T. bruceiand T. cruzi. The boxed region shows that the end point of one of the sequences may nothave been predicted correctly, given the overall similarity between the two sequences is sohigh. Gel image courtesy of A. Faldas.


Elongation factor γ

There is a single protein spot matched to EF-γ, near the bottom of the gel (orange and

boxed on Figure 7.6). A BLAST search of the EF-γ gene sequence hits only other EF-γ

sequences, and not the other EF genes, therefore this match is probably correct. However,

the spot is positioned near the base of the gel, indicating that this may only be a protein

fragment, therefore it is not definitive that the full protein of EF-γ is present on the gel.

Summary of elongation factor results

In summary, the results demonstrate that there are at least five genes encoding elongation

factors in T. brucei and many different protein spots appear on the 2-D gel, raising the

possibility that protein modifications are common. Modifications could regulate the activity

of elongation factor proteins, to achieve control over the translation of proteins. This is an

interesting area for further research because T. brucei does not modulate the rate of tran-

scriptional initiation, and it is likely that control over protein expression occurs downstream,

perhaps by regulating the rate of translation.

Heat shock proteins

The heat shock proteins (Hsp) are conserved across virtually all organisms, and are often

expressed in response to environmental stress. It has been shown that Hsps are up-regulated

when the temperature of the parasite’s environment is rapidly increased, for example during

transfer from the tsetse fly (25◦C) to the mammalian host (37◦C). At this time there are

extensive changes in morphology and metabolism of the parasite as it switches from the

procyclic form to the bloodstream form. It is thought that the expression of Hsp genes at

this time is crucial. It has been demonstrated that post-transcriptional control is exerted

to regulate the expression of Hsp70, and this control may be exerted at the level of mRNA

stability [193]. The proteome map of T. brucei suggests that many different protein forms

exist due the large number of distinct spots that have been matched to Hsp70, therefore

post-translational modifications may also be common.

The current level of annotation for T. brucei heat shock proteins is fairly poor, and

many spots on a single gel match Hsp70, although it is possible that in fact there are several

closely related genes, rather than the 62 distinct protein spots arising from one gene. An

analysis was carried out to identify how many distinct genes coded for the 62 protein spots

observed. Five distinct protein sequences were obtained that had been matched by mass


−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−LIGRKFSDSVVQSDMKHWPFKVVTKGDDKPVIQVQFRGETKTFNPEEISSLIGRKYTDAAVQADKKLLSYEVIADRDGKPKVQVMVGGKKKQFTPEEISAIIGRKYDDPDLQADMKHWPFKVTVK−EGKPVVEVEYQGERRTFFPEEISALIGRRFDDEHIQHDIKNVPYKIIRSNNGDAWVQ−−−DGNGKQYSPSQVGA

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−MVLLKMKEVAESYLGKQVAKAVVTVPAYFNDSQRQATKDAGTIAGLEVLR 172MVLQKMKEIAETYLGEKVKNAVVTVPAYFNDAQRQSTKDAGTIAGLNVVR 200MVLQKMKEIAESYLGEKVSKAVVTVPAYFNDSQRQATKDAGSIAGLEVLR 170FVLEKMKETAENFLGRKVSNAVVTCPAYFNDAQRQATKDAGTIAGLNVIR 192

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−IINEPTAAAIAYGLDKADEGKERNVLIFDLGGGTFDVTLLTIDGGIFEVKIINEPTAAAIAYGLN−−−KAGEKNILVFDLGGGTFDVSLLTIDEGFFEVVIVNEPTAAAIAYGMDRSSEGAMKTVLIFDLGGGTFDVTLLNIDGGLFEVRVVNEPTAAALAYGLD−−−KTKDSLIAVYDLGGGTFDISVLEIAGGVFEVK

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−ATNGDTHLGGEDFDNRLVAHFTEEFKRKNKGKDLSSNLRALRRLRTACER 272ATNGDTHLGGEDFDNNMMRHFVDMLKKK−KNVDISKDQKALARLRKACEA 296ATAGDTHLGGEDFDSRLVDYFATEFRTR−TGKDLRGNARAMRRLRTACER 269ATNGDTHLGGEDFDLCLSDHILEEFRKT−SGIDLSKERMALQRIREAAEK 288

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−AKRTLSSAAQATIEIDALFENID−−−−FQATITRARFEELCGDLFRGTLQAKRQLSSHPEARVEVDSLTEGFD−−−−FSEKITRAKFEELNMDLFKGTLVVKRTLSSSASTNIEIDALYEGFD−−−−FFSKITRARFEEMCRDQFERCLEAKCELSTTMETEVNLPFITANQDGAQHVQMMVSRSKFESLADKLVQRSLG

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−PVERVLQDAKMDKRAVHDVVLVGGSTRIPKVMQLVSDFFGGKELNKSINP 368PVQRVLEDAKLKKSDIHEIVLVGGSTRVPKVQQLISDFFGGKELNRGINP 392PVRKVLKDAEVDASAVDDVVLVGGSTRIPRVQQLVQNFFNGKEPNRSINP 365PCKQCIKDAAVDLKEISEVVLVGGMTRMPKVVEAVKQFFG−REPFRGVNP 387

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−DEAVAYGAAVQAFILTGGKSKQTEGLLLLDVAPLTLGIETAGGVMTALIKDEAVAYGAAVQAAVLTGESEVGGR−VVLVDVIPLSLGIETVGGVMTKLIEDEAVAYGAAVQAHIVSGGKSKQTKDLLLVDVTPLSLGVETAGGVMSVLIPDEAVALGAATLGGVLRG−−−−DVKGLVLLDVTPLSLGIETLGGVFTRMIP PAPRGVPQIEVTFDLDANGILSVSAEEKGTGKRNQIVITNDKGRLSKADIPAPRGVPQIEVTFDLDANGILSVSAEEKGTGKRNQIVITNDKGRLSKADIPAARGVPQIEVTFDVDENSILQVSAMDKSSGKKEEITITNDKGRLSEEEIPAPRGKPRITVSFDVNVDGILVVTAVEETAGKTQAITISNDKGRLSREQIPAPRGVPQIEVTFDIDANGICHVTAKDKATGKTQNITITAHGG−LTKEQI

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−MGTFDLSGIP 10RNTTIPTKKSQIFSTYSDNQPGVHIQVFEGERTMTKDCHLLGTFDLSGIP 468RNTQIPTKKSQVFSTHADNQPGVLIQVYEGERQLTKDNRLLGKFELSGIP 491RNTSVPAQKSQTFSTNADNQRSVEIKVYEGERPLVSQCQCLGTFTLTDIP 465KNTTIPTKKSQTFSTAADNQTQVGIKVFQGEREMASDNQMMGQFDLVGIP 483

ERMVSDAAKYEAEDKAQRERIDAKNGLENYAFSMKNTINDPN−VAGKLDD 109ERMVSDAAKYEAEDKAQRERIDAKNGLENYAFSMKNTINDPN−VAGKLDD 567ERMVREAAEFEDEDRKVRERVDARNSLESVAYSLRNQVNDKDKLGGKLDP 591DKMVAEAEKFAEEDRANAEKIEARNSVENYTFSLRSTLSDPD−VQQNISQ 564ENMIRDSEMHAEADRVKRELVEVRNNAETQANTAERQLTEWK−−−−YVTD 578

ADKNAVTTAVEEALRWLNDNQEASLDEYNHRQKELEGVCAPILSKMYQGMADKNAVTTAVEEALRWLNDNQEASLDEYNHRQKELEGVCAPILSKMYQGMNDKAAVETAVAEAIRFLDENPNAEKEEYKTALETLQSVTNPIIQKTYQSAEDQQKIQTVVNAVVNWLDENRDATKEEYDAKNKEIEQVAHPILSAYYVKRAEKENVRTLLAELRKVME−NPNVTKDELSASTDKLQKAVMECGRTEYQQA

GGGDAAGGMPGGMPGGMPG−−−−GMGGGMGGAAASSGPKVEEVD 199GGGDAAGGMPGGMPGGMPGGMPGGMGGGMGGAAASSGPKVEEVD 661GGGDKPQPMDDL−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 653AMEQAPPAPPSGE−−−−−−−−−−−−−−−−−−−GEGNAPVPDDVD 639AAANSGSSGSSSTEGQ−−−−−−−−−−−−−−GEQQQQQASGEKKE 657

a)b)c)d)e)

a)b)c)d)e)

a)b)c)d)e)

a)b)c)d)e)

a)b)c)d)e)

a)b)c)d)e)

a) −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−b) −−−−−−−−−−−−−−−−−−−−−−−−−−−−MTYEGAIGIDLGTTYSCVGVWQc) MSRMWLTTAAVFLTVTVAAVSAAPESGGKVEAPCVGIDLGTTYSVVGVWQd) −−−−−−−−−−−−−−−−−−−−−−−−−−−−−MPAPAIGIDLGTTYSCVGVFKe) −−−−−MLARRVCAPMCLASAPFARWQSSKVTGDVIGIDLGTTYSCVAVME

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−NERVEIIANDQGNRTTPSYVAFTDSERLIGDAAKNQVAMNPTNTVFDAKR 72KGDVHIIPNEMGNRITPSVVAFTDTERLIGDGAKNQLPQNPHNTIYTIKR 100NDQVEIVANDQGNRTTPSYVSFSETERLVGDAAKNQVAMNPTNTVFDAKR 71GDRPRVLENTEGFRTTPSVVAFKGQEKLVGLAAKRQAITNPQSTFFAVKR 95

Figure 7.10: A multiple alignment of five Hsp 70 protein sequences from T. brucei ; a) =TRYPtp2h24gd03.q1k 1, b) = TRYPtp30n4hh05.p1k 3, c) = TRYP xi-1015g04.q1k 13, d)= 125.m00218, e) = 92.m00252.

spectrometry data, all predicted to be Hsp70 by BLAST searches. A multiple alignment of

the five sequences has been performed (Figure 7.10). All five sequences are highly related

but no two appear similar enough to have arisen from an incorrect prediction of a single

gene, therefore there appear to be at least five distinct Hsp70 genes that exist in T. brucei.

The first sequence in the alignment is significantly shorter than the other four, possibly

indicating that the start of this gene has been incorrectly predicted, or it is a pseudogene.

The similarity between all sequences raises the possibility that mass spectrometry matches

to these proteins could be incorrect, however there are few long stretches in any sequence

that are identical to a different sequence, therefore it is likely that most peptide matches

will be made correctly. A study by Lee in 1998 suggested that there is an Hsp70 locus in

T. brucei containing 6 identical genes [193]. A search of GeneDB (May 2004) for the text

query: “∗heat shock protein∗” and “∗hsp∗” finds seven proteins that are predicted to be an

Hsp 70, of which four are clustered on chromosome 11, which may be the locus reported by

Lee, one sequence on chromosome 9 and two on chromosome 7.

A multiple alignment has been performed of the sequences retrieved from the current

release of GeneDB against those from the MS analysis, which come from GenBank and older


Figure 7.11: Protein spots matched to five different Hsp70 protein sequences. 125.m00218= blue; 92.m00252 = red; TRYP xi-1015g04.q1k 13 = yellow; TRYPtp30n4hh05.p1k 3 =white; all spots marked as cyan contain peptides that hit both TRYPtp2h24gd03.q1k 1(cyan) and TRYPtp30n4hh05.p1k 3 (white). Gel image courtesy of A. Faldas.

downloads of GeneDB that were used for the original MS analysis over the last year. The

alignment is displayed at the end of the chapter. Ten out of the twelve sequences appear

to be distinct, and two of the sequences from the MS analysis are identical to sequences

on chromosome 11. It is possible that these are the same genes however it is not possible

to verify, as there is no correspondence between different versions of sequence identifiers in

GeneDB. Two sequences in GeneDB: Tb09.160.3090 and Tb07.29K4.60 are annotated as

Hsp70, and contains several motifs that are highly similar to other Hsp70 sequences. Over

the full length however, they are more divergent and are 25% longer than the other Hsp70

sequences, therefore would be predicted to have a higher molecular weight and may in fact

be a closer match to a different heat shock protein.

Figure 7.11 displays which protein spot matches which sequence in the genome database.

There are distinct clusters of spots that match the sequence 92.m00252 (red) and 125.m00218

(blue). Only one sequence matches TRYP xi-1015g04.q1k 13 (yellow) at the bottom of the

gel, therefore this may only be a protein fragment. There is a cluster of spots predicted

to match TRYPtp30n4hh05.p1k 3 (white) and TRYPtp2h24gd03.q1k 1 (cyan), however all


those coloured cyan are matched to peptides that also exactly match TRYPtp30n4hh05.p1k 3

(white) therefore it is not possible to say from this analysis which is the correct protein

identification. It is possible that the MS results have incorrectly predicted the identity of

the proteins coloured cyan or white. It is not possible to say definitively that the protein

TRYPtp2h24gd03.q1k 1 (cyan), which has a very short sequence, is expressed in this sample,

and it may be a pseudogene.

7.3.2 Using data in RAPAD to improve genome annotation

An interface has been developed which allows external databases to link to protein records

in RAPAD. Unique ID numbers have been assigned to proteins that identify the database

version (v. 1) so that in future database versions, a link can be provided to the most recent

records. A record displays the protein name, has a link to the corresponding gel with the

spot highlighted, and provides evidence about the quality of the match to MS data. When

the data is released to the public, the web page for each protein can be referenced from other

databases. Alternatively, a more robust approach would be for other databases to store the

unique ID number that has been assigned to each protein, and maintain a single URL to

where the current implementation of the database is located. This feature will be used by

the genome database, when the existence of a protein has been verified by the proteome

map, as discussed in Chapter 5. The interface that allows public access to T. brucei data in

RAPAD is displayed in Figure 7.12.

Hypothetical proteins

An analysis has been performed to find the number of distinct proteins stored in RAPAD

which are named as a “hypothetical protein”. A simple search of RAPAD for the word

hypothetical in the protein name reveals 100 matching entries that arise from 47 distinct

spots on the master gel. It is therefore likely that the actual number of proteins that are

annotated as hypothetical on the master gel, is somewhere between 47 and 100 because it is

possible that there is more than one distinct protein annotated as hypothetical in a single

spot. However, given that many sequences have not been manually curated, the genome

database may contain a large number of open reading frames that have been incorrectly

predicted, and the sequence may have been hit by chance. A further database search reveals

that 24 out the 100 proteins are matched with a sequence coverage of less than 5%, therefore

these may not be true matches.


Figure 7.12: The interface for publishing T. brucei proteome data. The initial page displaysimages of gels that are stored and the number of identified proteins on each gels. A list ofproteins can be generated and individual records can be displayed.


Figure 7.13: A search using the Gel Viewer reveals 100 proteins, annotated as “hypothetical”.Gel image courtesy of A. Faldas.

The protein sequences hit by MS data were obtained for 88 out the 100 sequences. The

other 12 sequences could not be obtained because the ID numbers that are stored in RAPAD

have changed in GeneDB, and there is no link to the current record. This is a major problem

for researchers working with genome sequences before they are complete because temporary

ID numbers are assigned to proteins that are later deleted, and databases often do not

maintain an archive of previous identifiers. For the spots that do not link to a current

GeneDB record, the MS searches must be repeated, which is time consuming. There will

be an option in the next release of RAPAD to perform repeated MS searches automatically.

Of the 88 sequences, there were 57 distinct protein sequences that have been matched. A

piece of software was written that matches the peptide sequences hit by mass spectrometry

results to the protein sequences, to determine which spots on the gel matched which protein

sequence. It was discovered that there are ten proteins that have been matched to more than

one spot, in total matched to 33 spots. The diagram in Figure 7.13 displays all the spots

that have been annotated as matching a protein whose name contains the word hypothetical.

Many of the proteins lie at the bottom of the gel, indicating possible protein fragments that

may have been matched to short protein sequences in the database. Several of the database

sequences annotated as hypothetical are very short, of which the shortest contains only 39


1)

3)

2)

Figure 7.14: The protein spots that have been matched to different hypothetical proteins.The spots with the same colour label have been matched to the same database sequence.The three boxed regions are discussed further in the text. Gel image courtesy of A. Faldas.

amino acids, and is very unlikely to be a correctly predicted protein.

There are ten hypothetical proteins that have been matched to more than one gel spot.

Groups of spots matched to the same protein are displayed in a particular colour on Figure

7.14. Three pairs of spots that have been matched to three different hypothetical proteins

have been highlighted for further study because they reside in the middle of the gel, therefore

are unlikely to be protein fragments. Furthermore, two spots matched to one protein, located

next to each other, are unlikely to be incorrect matches because the probability of two

adjacent spots independently matching the same sequence is low. However, it is still possible

that an incorrect match could be made to a short “hypothetical” protein sequence in the

database if there were two spots containing the same protein that had a peptide that matched

the hypothetical protein by chance.

Spot group 1

The spots marked 1 in Figure 7.14 (Spot IDs 313 and 275) are both fairly strongly matched

to a 438 amino acid protein, annotated as “Conserved hypothetical protein”. The left spot

contains only this protein, the right spot is predicted to match five different proteins: ATPase


beta subunit; Lipophosphoglycan biosynthetic protein; α-tubulin; Conserved hypothetical

protein; and Hsp83-1. A BLAST search of the hypothetical sequence against the non-

redundant (NR) database at GenBank reveals a top hit matching the following entry, with

an e-value of 0.15 (not highly significant):

NP_937883 1392 aa DEFINITION restin isoform b; cytoplasmic linker 1;

Reed-Steinberg cell-expressed intermediate filament-associated protein

[Homo sapiens].

The finding that the protein does not strongly match any annotated entries in the genome

databases for other organisms indicates that the protein may be specific to T. brucei and

its close relatives. It may therefore be a good candidate for further functional analysis to

determine if it is essential for the life cycle of trypanosomes.

Spot group 2

The spots marked 2 are both matched to several different proteins, the spot on the left

(ID 330) matches: Hsp81-2, S-adenosylmethionine synthetase, Hypothetical protein, con-

served; Hypothetical protein, conserved (possible RNA binding protein), and β-tubulin. The

right protein (ID 323) is annotated as: Elongation factor 2; Conserved hypothetical protein;

Hsp81-2; α-tubulin; Hsp70; S-adenosylmethionine synthetase. Both spots are matched to

S-adenosylmethionine synthetase and the hypothetical protein, therefore it is possible that

the hypothetical protein is highly similar to S-adenosylmethionine synthetase and has been

matched by chance. However, the match has been made based on tandem MS data, using

peptide sequence information and manual inspection of the results demonstrates that the

two matches are to peptides of different sequences.

A BLAST search of the sequence for the conserved hypothetical protein matches several

sequences. The strongest match is to an ATPase from a blue-green algae, with an e-value of

1e−88, which is highly significant. The mass spectrometry data for the spots matches three

and four peptides respectively to the database sequence, therefore there is a good chance

that the match was made correctly. The evidence suggests that this “Conserved hypothetical

protein” is an ATPase.

LOCUS ZP_00327657 609 aa linear BCT 17-JUN-2004

DEFINITION COG3044: Predicted ATPase of the ABC class

[Trichodesmium erythraeum IMS101].


Spot group 3

The two spots marked 3 in Figure 7.14 (IDs 543 and 548) are annotated as a hypothetical

protein on chromosome 4, and have been matched at a 5% sequence coverage. This value

is fairly low, but given that two proximally located spots have been matched to the same

sequence independently, it is reasonable to assume that a correct match has been made. A

BLAST search of the sequence matches several proteins from other organisms at a fairly

high degree of significance, however all are annotated only as hypothetical proteins. The

degree of similarity (e-value 5e−05) to the top matching sequence (GenBank ID NP 522045)

indicates that the sequence is likely to be a real protein, but more work is required to assign

a function. There are also six other spots that have been matched to this protein, located at

the base of the gel. These spots probably contain fragments of the protein but this indicates

that the protein is fairly abundantly expressed.

7.3.3 Search for post-translational modifications

The method of searching MS data for possible modifications to peptides, which was detailed

in the last chapter, was repeated for the T. brucei data. A manual search was performed to

investigate if peptides have an altered mass resulting from phosphorylation, deamidation or

acetylation. Several clusters of spots that appeared to result from PTMs were investigated

but this method of searching for modifications has several major limitations and therefore a

large scale approach has not been undertaken at the present time. The results from two of

searches are presented below.

Arginine kinase

There is a cluster of four spots in the middle of the gel which have been matched to arginine

kinase (Figure 7.15). Arginine kinase is thought to be important in protozoans because

it is up-regulated in response to cell stress, fulfilling the same role as creatine kinase in

multicellular eukaryotes, and is a possible target for chemotherapy [244].

The spots marked 575 and 554 are predicted to have undergone deamidation. Deamida-

tion is the conversion of a glutamine residue to glutamate, or asparagine to aspartate, and

it is known to occur during the degradation of proteins [282]. The oxidation positions are

caused by the experimental process and not indicative of a protein’s status in vivo. A phos-

phorylation and deamidation has been detected on the same peptide for spot 535, however

the e-value is high (expect = 31) and may represent an artifact.


99 − 118 728.74 2183.19 2183.07 0.12 1 QPPKDFGDLNTLVDVDPEGK 143 − 151 608.77 1215.53 1215.47 0.06 0 EQYEEMESR Oxidation (M) 152 − 175 922.80 2765.37 2766.27 −0.90 1 VREQLSTMTDDLQGTYYPLSGMTK Deamidation (NQ); 2 Oxidation (M) 154 − 175 837.75 2510.21 2510.12 0.10 0 EQLSTMTDDLQGTYYPLSGMTK 2 Oxidation (M) 176 − 189 588.00 1760.96 1760.87 0.09 0 ETQQQLIDDHFLFK

Start − End Observed Mr(expt) Mr(calc) Delta Miss SequenceSpot 575

Spot 575

Spot 571

27 608.74 1215.46 1215.47 −0.01 0 30 0.66 1 EQYEEMESR + Oxidation (M) 30 734.36 1466.70 1466.72 −0.03 0 26 1.9 1 SLAGYPFNPCLTK 33 867.40 1732.79 1732.82 −0.03 0 59 0.0013 1 DFGDLNTLVDVDPEGK 35 587.96 1760.87 1760.87 −0.00 0 33 0.46 1 ETQQQLIDDHFLFK 40 728.69 2183.06 2183.07 −0.02 1 29 1.5 1 QPPKDFGDLNTLVDVDPEGK 46 837.69 2510.05 2510.12 −0.06 0 25 4.7 1 EQLSTMTDDLQGTYYPLSGMTK + 2 Oxidation (M) 47 657.58 2626.31 2626.31 −0.00 2 37 0.33 1 VTDKQPPKDFGDLNTLVDVDPEGK 49 922.76 2765.25 2765.29 −0.04 1 37 0.31 1 VREQLSTMTDDLQGTYYPLSGMTK + 2 Oxidation (M)

12 427.74 853.47 854.38 −0.91 0 10 31 5 AVNTIEK + Deamidation (NQ); Phospho (ST)

Spot 535

Spot 528

Spot 535

Query Obs Mr(expt) Mr(calc) Delta Miss Score Expect Rank Peptide

Figure 7.15: Four spots containing arginine kinase. The MS results for spots 575 and 535reveal possible modifications. Gel image courtesy of A. Faldas.

Initiation factor

There are four spots that have been strongly matched to eukaryotic initiation factor 5 (Fig-

ure 7.16). Of these, Spot 575 contains both initiation factor and arginine kinase by chance.

A deamidation has been observed for the match to initiation factor protein. Spot 554 has

also been predicted to have undergone deamidation. The spots are all likely to have slight

differences in the chemical sidechains, causing the four different spots to appear. A deami-

dation causes a slight change in mass and an alteration in the charge of the protein but it is

likely that there are other modifications that are not observed in the MS data, which cause

the different spots to appear.

7.3.4 Results Summary

The investigations into multiple protein products demonstrate the core functionality of RA-

PAD. RAPAD supports the finding and visualisation of spots that have been identified as the

same protein. Additional software was developed alongside RAPAD to determine the range

of peptides that were matched in mass spectrometry results, and to provide a visualisation

of the clusters. The visualisation software highlighted some unusual results for the tubulin


31 826.21 3300.83 3301.65 −0.82 1 64 0.00077 1 VSIVALDIFTGNKMEDQAPSTHNVEVPFVK + Deamidation (NQ); Oxidation (M)

38 922.80 2765.37 2766.27 −0.90 1 51 0.014 1 VREQLSTMTDDLQGTYYPLSGMTK + Deamidation (NQ); 2 Oxidation (M)

Query Observed Mr(expt) Mr(calc) Delta Miss Score Expect Rank Peptide 8 430.80 859.59 859.50 0.09 0 56 0.0012 1 VIDLSVSK 14 688.92 1375.82 1375.77 0.05 0 81 5e−06 1 VSIVALDIFTGNK 21 648.69 1943.05 1943.89 −0.85 0 60 0.0011 1 MEDQAPSTHNVEVPFVK + Deamidation (NQ); Oxidation (M)

Query Observed Mr(expt) Mr(calc) Delta Miss Score Expect Rank Peptide 17 608.77 1215.53 1215.47 0.06 0 30 0.63 1 EQYEEMESR + Oxidation (M) 25 588.00 1760.96 1760.87 0.09 0 50 0.0087 1 ETQQQLIDDHFLFK 27 728.74 2183.19 2183.07 0.12 1 34 0.54 1 QPPKDFGDLNTLVDVDPEGK 35 837.75 2510.21 2510.12 0.10 0 26 3.8 1 EQLSTMTDDLQGTYYPLSGMTK + 2 Oxidation (M)

Spot 554

Spot 575

Spot 554Spot 557

Spot 571

Spot 575

Figure 7.16: There are four spots that match initiation factor 5, of which possible modifica-tions were found for spots 554 and 575. Gel image courtesy of A. Faldas.

proteins and, coupled with the multiple alignments, should improve annotation of Elongation

Factor sequences. The visualisation of heat shock protein 70 results indicates that there are

at least five different gene sequences from which Hsp70 proteins are expressed in the sample.

The visualisation makes it clear that only very short spans of peptides are present in spots

at the base of the gel, indicating that they are protein fragments. It is an area for future

investigation to determine if these are biologically meaningful, or experimental artifacts.

The analysis demonstrates a strong correlation between spots that are proximally located

and the span of peptide hits, even for spots that are not fragments but probably contain full

length proteins. An investigation was also carried out to verify that sequences annotated as

hypothetical proteins in the genome database were real proteins identified in the proteome

study. Three proteins were analysed in detail, of which two of the sequences are likely to be

real proteins, but a definitive function cannot be assigned at this time. The other protein

appears to be an ATPase in T. brucei, and the next version of the genome database should

update this annotation. Finally, a search for PTMs within MS data was undertaken and

several potential sites were found. There are major limitations with the method of searching

for PTMs and therefore other experiments are required to confirm modifications. The issues

raised by the results are discussed in Section 7.4.1.


7.4 Discussion

The annotation of an organism’s genome is a major challenge once sequencing is nearing

completion. The usual method to assign functions to newly sequenced genes is to apply

computational methods to find proteins in other organisms that are homologous and have

a functional assignment. After this initial stage, the slow process begins of performing

laboratory investigations to determine the mechanism of action of proteins, and to search

for the proteins that are important in disease. The biological goals of the trypanosome

project are to catalogue all the expressed proteins that can be found by various methods.

In particular, the proteome study is able to verify that genes annotated as “hypothetical

proteins” are expressed in the particular cell line. The project also aims to shed light on the

number of different forms of proteins that are found.

The core functionality of RAPAD has aided the management of large volumes of data

for the proteome investigation. This has been facilitated due to the feature that allows bulk

uploads of data, enabling the protein identifications to be moved easily from the previously

used method (spreadsheets). This reduces the overhead of manual data entry which is time

consuming and error prone. The database query facilities allow the data set to be searched

and filtered, which is important for large data sets. In Section 7.2, a series of questions was

outlined that RAPAD may be able to solve, which are answered here.

Q. Can the time and labour to identify proteins be reduced?

The RAPAD Querier allows researchers to verify which proteins have been strongly or weakly

matched, and there is a facility for loading very large amounts of protein data in bulk, into the

system. However, in the current implementation there is no automated pipeline for moving

raw mass spectrometry data to the MASCOT server, and placing the results of searches in

RAPAD. This feature will be considered for the next version of the database (Section 7.4.4).

Q. How many different proteins can be identified from 2500 spots?

In the system at the present time there are almost 1000 identified proteins for 650 spots,

across three gels. In the previous chapter, the combination of data across replicate gels was

discussed, therefore the system should easily scale up to 2500 spots and many more.

Q. How widespread and common are post-translational modifications?

The additional investigations into the causes of multiple spots that match the same protein


demonstrate that post-translational modifications (PTMs) are very common for some pro-

teins. The software was also able to demonstrate that many of the spots near the base of the

gel are almost certainly fragments of proteins, and are not caused by PTMs. A search of MS

results to confirm types of modification did not reveal any significant results, demonstrating

that more biological investigations are required.

Q. How can we improve the T. brucei genome annotation?

The interface for publishing data allows the genome database to connect to records in RA-

PAD, which verify the existence of proteins. The analysis reported in this chapter will aid

the annotation of several groups of genes, summarised in Section 7.4.1.

Q. Can we build a “point and click” virtual 2D gel?

The Gel Viewer provides this facility by dynamically linking the spots on the gel to individual

records for each protein. The results of complex queries in RAPAD can also be visualised in

the Gel Viewer, providing a system for data analysis and management that is more powerful

than the facilities offered by commercial image analysis applications.

Q. Can we build pages that give original MS data interpretations?

One feature that has not been employed at this time in RAPAD is to automate repeated

searching of MS data, for example to search for different types of PTM that could be found

within the data. A number of searches have been performed manually in MASCOT to find

modifications on peptides, however very few positive results have been obtained. Therefore,

there may be limited benefits in implementing an automated search at this time. The graphic

showing peptide hits within protein sequences is a novel visualisation of MS results, and is

discussed further in Section 7.4.2.

The database does not store raw MS data in the present implementation due to the size of

the files and the fact that the raw data is in a proprietary format that can only be interpreted

with software that is installed on a few terminals, which is a major drawback for re-analysis

of data. The next version of RAPAD may include an automated system for analysing MS

data, similar to the SASHIMI software [278] developed at the Institute for Systems Biology,

in the proteomics group headed by Ruedi Aebersold. SASHIMI is open source software that

aims to improve the downstream analysis of MS data. It comprises an application that

converts raw MS data, from any of the instruments that are available, into a single XML-


based format that can be analysed with a number of software packages to standardise the

identification of proteins.

7.4.1 Improving the annotation of genes

The additional investigations identified several sequences in the genome database, which may

have been incorrectly predicted. The study also discovered that there are several different

proteins with highly related, but not identical sequences, which have the same protein name.

It is likely that the protein families were formed by relatively recent gene duplication events,

and the function of these protein families may be redundant. However, it is also possible that

different members of the family perform slightly different roles. For example, the finding that

up to five different proteins, annotated as heat shock protein 70, are strongly expressed raises

the possibility that all of the different forms are functionally significant. It is believed that

Hsps may be important when trypanosomes infect the mammalian host, therefore clearly

the current naming strategy for these proteins is inadequate. At this time gene annotation

is not yet finalised for T. brucei therefore we believe that the Hsp genes should have a suffix

on the name that uniquely identifies each one, for example the chromosome position of each

gene, plus a letter if more than one sequence resides on the same arm of a chromosome e.g.

Hsp70 (11 p a).

The analysis reveals that most proteins near the bottom of the gel, in the small molecular

weight range, have very short spans of peptides matched, indicating that these proteins are

likely to be fragments caused either experimentally, or in vivo. It is possible that these

spots do not have great biological significance. The visualisation software highlighted an

unexpected result for both β-tubulin and α-tubulin. It was observed that several spots near

the base of the gel, which would be predicted to have a low molecular weight, matched

peptides from the two ends of the protein sequence. There are several possible explanations

for this result, one of which is that splicing occurs at the level of mRNA, resulting in a

protein that is formed from the two ends of the gene sequence. The evidence presented

here is far from sufficient to confirm this hypothesis but it is still open for discussion how

these spots arose. Additional experiments are required to investigate the result, for example

by performing MS/MS to sequence as many peptides as possible to determine the exact

constituents of the protein spot.

An interesting finding from the visualisation is that proteins of the same name, in the

same region of the gel, tend to have a similar span of peptide hits. It might be expected that


the distribution of peptides matched from the same protein would be fairly similar for all

spots regardless of their position on the gel, only subject to random variation in ionisation

and detection of peptides. It is known that certain chemical modifications cause peptides

to ionise less well in MS, such as phosphorylation, therefore it could be expected that spots

that have the same span of peptide hits, have a shared set of modifications to the peptides

that are detected by MS. Small differences in the range of peptides matched between spots

located near each other could indicate the loss or gain of a modification. For example, if pro-

tein A matches peptides covering the range 50-80 amino acids in the sequence, and protein

B matches peptides covering 50-95, this may indicate that protein A has an additional phos-

phate group on the peptide from position 81-95, preventing its detection by MS. However,

there is also a technical explanation for the correlation between peptide span and spot posi-

tion. Spots closely co-located are more likely to have been included in the same MS run and,

as ionisation efficiency is highly variable, spots on the same MALDI plate may be subject to

more similar ionisation conditions. This is an area that requires further investigation, such as

performing experiments with radioactive or fluorescently labelled phosphates, coupled with

the visualisation software, to determine the phosphorylation status of protein spots and the

span of peptide hits to verify if the peptides matched are related to modification status. Our

results indicate that there is a high correlation between the peptides detected by MS and a

protein’s position on a gel, and as far as we are aware, this has not been previously reported.

7.4.2 Visualisation issues in the life sciences

In general, the visualisation of life sciences data requires significant further research, and

there are few examples of published work concerning investigations into best practice for

visualising large data sets. Software for biomedical applications is often created without

developers applying standard guidelines for graphical user interface design, leading to the

generation of systems that are not intuitive for users.

The visualisation of the span of peptide hits is a new method for viewing mass spectrom-

etry results on a 2-D gel. A similar approach could be adopted to view microarray results,

such as displaying the extent of the hybridization signal for different probes within each

feature on the array. The other use of the Gel Viewer reported in Section 7.3.2, in which

different colours are used to display the clusters of spots that match the same protein, is a

standard method for summarising complex data, and could potentially be used to display a

variety of functional genomics data.


The visualisation software displaying the span of peptide hit will be included in the next

release of RAPAD and could be adapted to show other facets of the mass spectrometry

results. The height of the bar could be used to indicate the e-value or score assigned by the

search software. The software could be also adapted to display different proteins that have

been matched to the same gel spot, using different shading of bands on the spot label.

7.4.3 Analysis of modifications

On the 2-D gel, there are several clusters of spots that match the same protein, which are

likely to be the result of different PTMs causing slight changes in the mass or charge of the

protein. A search was performed on MS data to confirm the modifications but this only

revealed a few results that are not highly significant. The main problems are that many

proteins are identified by only a small proportion of the total peptides in the sequence and

the majority of the modifications will not be observed. This issue was discussed in more

detail in the previous chapter but the additional visualisations presented in this chapter

could ultimately help to find and display modifications. The graphic showing the peptide

that had been matched could be modified to display more detailed information. For example,

the labelling bars next to spots could display the peptides that have been matched along the

length of the protein, with a graphic showing possible modification sites along the protein

sequence, obtaining the sites from an in silico analysis of the protein sequence, or a database

of known modifications. If a particular peptide, that was detected in one spot and not

another, had a known modification site, this could provide evidence that the peptide had

been modified in one of the spots. The major hindrance to this effort is that there are

no major databases of modifications, even though there is a very large amount of research

that has been performed over several decades identifying modifications. It is hoped that the

future integration of RAPAD into GUS will allow researchers to publish and distribute data

about modifications to the wider research community.

7.4.4 Future work

The proteome map of T. brucei comprises 2-DE and MS derived data. It is planned that

other techniques, such as LC-MS (reported in Chapter 1), will be used to generate even

greater volumes of protein data. The RAPAD database schema has capabilities for storing

this kind of information but the web pages have not yet been created for data entry or the

visualisation of results. A major issue will be the integration of this data with the gel based


studies. In the near future the data must also be made available to the curators of GeneDB

to enable improvements to the annotation of genes. The long term goals are to integrate

the proteome part of RAPAD into GUS, which will enable the proteome data to be stored

directly within GeneDB.

7.5 Conclusions

The core functionality of RAPAD has greatly improved the data management facilities for

the Trypanosoma brucei proteome project by enabling queries over the large data set to

find proteins of interest. Additional investigations have been performed on several groups

of proteins that appear abundantly on 2-D gels, for which the genome annotation is poor.

The results demonstrate one way in which experimental data, coupled with bioinformatics

analysis, can find protein sequences that have been incorrectly predicted. The visualisation

of results in new ways could be applied to proteome data from any organism and would aid

the annotation of newly sequenced genomes. The large data set generated by the T. brucei

investigation also demonstrates the scalability of the current implementation of RAPAD.

Appendix: Alignment of Hsp70 sequences

A multiple alignment has been performed with ClustalW on twelve sequences predicted to

match heat shock protein 70. Five sequences are from the MS results matched by proteins

in the T. brucei proteome map, and seven sequences are from the current version of the T.

brucei genome database (Section 7.3.1). Tb09.160.3090 and Tb07.29K4.60 are considerably

longer than the other sequences, and align poorly. Therefore, they may have been incorrectly

predicted or if correctly predicted, should be named as a different heat shock protein e.g.

Hsp80.

Tb11.01.3110 -------------------------------MTYEGAIGIDLGTTYSCVG 19

TRYPtp2h24gd03.q1k_1 --------------------------------------------------


TRYPtp30n4hh05.p1k_3 -------------------------------MTYEGAIGIDLGTTYSCVG 19


TRYP_xi-1015g04.q1k_13 ---MSRMWLTTAAVFLTVTVAAVSAAPESGGKVEAPCVGIDLGTTYSVVG 47

Tb07.29K4.620 --------------------------------MPAPAIGIDLGTTYSCVG 18

125.m00218 --------------------------------MPAPAIGIDLGTTYSCVG 18


92.m00252 --------MLARRVCAPMCLASAPFARWQSSKVTGDVIGIDLGTTYSCVA 42

Tb07.29K4.60 MQHAVEIEAKRRVELDEATRARYVVVKEETRASGDRVIGIDLGTTNSCIS 50

Tb09.160.3090 -----MLCLAQWALLLVLCLVGCCCTVSGGSEVLAVDIGADWAKGATRVI 45

Tb11.01.3110 VWQN--ERVEIIANDQGNRTTPSYVAFTDSE-----------RLIGDAAK 56


TRYPtp2h24gd03.q1k_1 --------------------------------------------------


TRYPtp30n4hh05.p1k_3 VWQN--ERVEIIANDQGNRTTPSYVAFTDSE-----------RLIGDAAK 56


TRYP_xi-1015g04.q1k_13 VWQK--GDVHIIPNEMGNRITPSVVAFTDTE-----------RLIGDGAK 84

Tb07.29K4.620 VFKN--DQVEIVANDQGNRTTPSYVSFSETE-----------RLVGDAAK 55

125.m00218 VFKN--DQVEIVANDQGNRTTPSYVSFSETE-----------RLVGDAAK 55

Tb11.01.3080 VWQN--ERVEIIANDQGNRTTPSYVAFVNNE-----------VLVGDAAK 56

92.m00252 VMEG--DRPRVLENTEGFRTTPSVVAFKGQE-----------KLVGLAAK 79

Tb07.29K4.60 YIDKKTNRPKIIPSPTGSWVFPTAITFDKSHKV---------RLYGEEAR 91

Tb09.160.3090 GGST-APRASIVLNDQTNRKSPQCIAFRIVPNAGNDTLRSVERLFAEEAR 94

Tb11.01.3110 NQVAMNPTNTVFDAKRLIGRKFSDSVVQ---------------------S 85

TRYPtp2h24gd03.q1k_1 --------------------------------------------------


TRYPtp30n4hh05.p1k_3 NQVAMNPTNTVFDAKRLIGRKFSDSVVQ---------------------S 85


TRYP_xi-1015g04.q1k_13 NQLPQNPHNTIYTIKRLIGRKYTDAAVQ---------------------A 113

Tb07.29K4.620 NQVAMNPTNTVFDAKRIIGRKYDDPDLQ---------------------A 84

125.m00218 NQVAMNPTNTVFDAKRIIGRKYDDPDLQ---------------------A 84

Tb11.01.3080 NHAARGSNGVIFDAKRLIGRKFSDSVVQ---------------------S 85

92.m00252 RQAITNPQSTFFAVKRLIGRRFDDEHIQ---------------------H 108

Tb07.29K4.60 ACVRTSASATLCSGKRLIGRGVGELGRV---------------------Q 120

Tb09.160.3090 SLEPRFPQQSICGPSLLAGLIVSKEISAGQKHHEQTGNQRSEREGVISFS 144

Tb11.01.3110 DMKHWPFKVVTKGDDKPVIQVQFRG--------ETKTFNPEEISSMVLLK 127

TRYPtp2h24gd03.q1k_1 --------------------------------------------------


TRYPtp30n4hh05.p1k_3 DMKHWPFKVVTKGDDKPVIQVQFRG--------ETKTFNPEEISSMVLLK 127


TRYP_xi-1015g04.q1k_13 DKKLLSYEVIADRDGKPKVQVMVGG--------KKKQFTPEEISAMVLQK 155

Tb07.29K4.620 DMKHWPFKVTVK-EGKPVVEVEYQG--------ERRTFFPEEISAMVLQK 125

125.m00218 DMKHWPFKVTVK-EGKPVVEVEYQG--------ERRTFFPEEISAMVLQK 125

Tb11.01.3080 DMKHWPFKVEEGEKGGAVMRVEHLG--------EGMLLQPEQISARVLAY 127

92.m00252 DIKNVPYKIIRSNNGDAWVQ---DG--------NGKQYSPSQVGAFVLEK 147

Tb07.29K4.60 SQLHKTNMVTLNERGEVAVEIM------------GRTYTVTHIIAMFLRY 158

Tb09.160.3090 DTDRFTYVVVPQIRRKSAVVRITPGGSSEGTTTAPIEFTVEELIGMILGH 194

Tb11.01.3110 MKEVAESYLG-KQVAKAVVTVPAYFNDSQRQATKDAGTIAGLEVLRIINE 176

TRYPtp2h24gd03.q1k_1 --------------------------------------------------


TRYPtp30n4hh05.p1k_3 MKEVAESYLG-KQVAKAVVTVPAYFNDSQRQATKDAGTIAGLEVLRIINE 176


TRYP_xi-1015g04.q1k_13 MKEIAETYLG-EKVKNAVVTVPAYFNDAQRQSTKDAGTIAGLNVVRIINE 204

Tb07.29K4.620 MKEIAESYLG-EKVSKAVVTVPAYFNDSQRQATKDAGSIAGLEVLRIVNE 174

125.m00218 MKEIAESYLG-EKVSKAVVTVPAYFNDSQRQATKDAGSIAGLEVLRIVNE 174

Tb11.01.3080 LKSCAESYLG-KQVAKAVVTVPAYFNDSQRQATKDAGTIAGLEVLRIINE 176

92.m00252 MKETAENFLG-RKVSNAVVTCPAYFNDAQRQATKDAGTIAGLNVIRVVNE 196

Tb07.29K4.60 LKKEAEKFLK-EPVNAVVVSVPAFFTPQQKVATEDAALAAGFDVLEVIDE 207

Tb09.160.3090 MKRSAERSLDGAPVRHLVLVVPTSSSLAYRQAMVDAAAVVGLRTIRLVHG 244

Tb11.01.3110 PTAAAIAYGLDK----------ADEGKERNVLIFDLGGGTFDVTLLTIDG 216

TRYPtp2h24gd03.q1k_1 --------------------------------------------------


TRYPtp30n4hh05.p1k_3 PTAAAIAYGLDK----------ADEGKERNVLIFDLGGGTFDVTLLTIDG 216


TRYP_xi-1015g04.q1k_13 PTAAAIAYGLNK----------AGE---KNILVFDLGGGTFDVSLLTIDE 241

Tb07.29K4.620 PTAAAIAYGMDR----------SSEGAMKTVLIFDLGGGTFDVTLLNIDG 214

125.m00218 PTAAAIAYGMDR----------SSEGAMKTVLIFDLGGGTFDVTLLNIDG 214

Tb11.01.3080 PTAAAIAYGLDK----------ADEGKERNVLVFDFGGGTFDVSIISVSG 216


92.m00252 PTAAALAYGLDK----------TKDS---LIAVYDLGGGTFDISVLEIAG 233

Tb07.29K4.60 PSAACLAHTVLQPSNASSREHLSGSKRIVRSLVFDLGGGTLDCAVMENDR 257

Tb09.160.3090 SAAAATQLAHLNTETLFRG-HPSNTTERKYAMIYDMGSSKTEVAVFRFTP 293

Tb11.01.3110 -------GIFEVKATNGDTHLGGEDFDNRLVAHFTEEFKRKN-------- 251

TRYPtp2h24gd03.q1k_1 --------------------------------------------------


TRYPtp30n4hh05.p1k_3 -------GIFEVKATNGDTHLGGEDFDNRLVAHFTEEFKRKN-------- 251


TRYP_xi-1015g04.q1k_13 -------GFFEVVATNGDTHLGGEDFDNNMMRHFVDMLKKK--------- 275

Tb07.29K4.620 -------GLFEVRATAGDTHLGGEDFDSRLVDYFATEFRTR--------- 248

125.m00218 -------GLFEVRATAGDTHLGGEDFDSRLVDYFATEFRTR--------- 248

Tb11.01.3080 -------GVFEVKATNGDTHLGGEDVDAALLEHALADIRNRY-------- 251

92.m00252 -------GVFEVKATNGDTHLGGEDFDLCLSDHILEEFRKT--------- 267

Tb07.29K4.60 R-----RGTFTLVATHGDPLLGGNDWDAVLSQHFSDQFERKWR----VPL 298

Tb09.160.3090 ATARDDFGTVTLVASATNHTLGGRSFDRCLARYVERNLFPAAKPTPVTPV 343

Tb11.01.3110 --KGKDLSSNLRALRRLRTACERAKRTLSSAAQATIEIDALF-------E 292

TRYPtp2h24gd03.q1k_1 --------------------------------------------------


TRYPtp30n4hh05.p1k_3 --KGKDLSSNLRALRRLRTACERAKRTLSSAAQATIEIDALF-------E 292


TRYP_xi-1015g04.q1k_13 --KNVDISKDQKALARLRKACEAAKRQLSSHPEARVEVDSLT-------E 316

Tb07.29K4.620 --TGKDLRGNARAMRRLRTACERVKRTLSSSASTNIEIDALY-------E 289

125.m00218 --TGKDLRGNARAMRRLRTACERVKRTLSSSASTNIEIDALY-------E 289

Tb11.01.3080 --GIEQGSLSQKMLSKLRSRCEEVKRVLSHSTVGEIALDGLLP------D 293

92.m00252 --SGIDLSKERMALQRIREAAEKAKCELSTTMETEVNLPFITAN---QDG 312

Tb07.29K4.60 EDAEGNVGQGVATYRQLLLEAEKAKIHFTHSTEPYYGYNRAFHFSEKLRD 348

Tb09.160.3090 LDRKPVTATTRRAVVSLLRAVNAARERLSVNQNVPFVVPGVRE------D 387

Tb11.01.3110 NIDFQATITRARFEELCGDLFRGTLQPVERVLQDAKMDKRAVHDVV---L 339

TRYPtp2h24gd03.q1k_1 --------------------------------------------------


TRYPtp30n4hh05.p1k_3 NIDFQATITRARFEELCGDLFRGTLQPVERVLQDAKMDKRAVHDVV---L 339


TRYP_xi-1015g04.q1k_13 GFDFSEKITRAKFEELNMDLFKGTLVPVQRVLEDAKLKKSDIHEIV---L 363

Tb07.29K4.620 GFDFFSKITRARFEEMCRDQFERCLEPVRKVLKDAEVDASAVDDVV---L 336

125.m00218 GFDFFSKITRARFEEMCRDQFERCLEPVRKVLKDAEVDASAVDDVV---L 336

Tb11.01.3080 GEEYVLKLTRARLEELCTKIFARCLSVVQRALKDASMKVEDIEDVV---L 340

92.m00252 AQHVQMMVSRSKFESLADKLVQRSLGPCKQCIKDAAVDLKEISEVV---L 359

Tb07.29K4.60 IVPLEATLTLEEYIELTRPLRVRCVECLNKLFDHTSIRPADIDNVL---L 395

Tb09.160.3090 GGDFIANISRAQFEEACGELFNEAVRLRDHAITQTNGTVRSLNELVRLEL 437

Tb11.01.3110 VGGSTRIPKVMQLVSDFFGGKELNKSINPDE-AVAYGAAVQAFILTGG-- 386

TRYPtp2h24gd03.q1k_1 --------------------------------------------------


TRYPtp30n4hh05.p1k_3 VGGSTRIPKVMQLVSDFFGGKELNKSINPDE-AVAYGAAVQAFILTGG-- 386


TRYP_xi-1015g04.q1k_13 VGGSTRVPKVQQLISDFFGGKELNRGINPDE-AVAYGAAVQAAVLTG--- 409

Tb07.29K4.620 VGGSTRIPRVQQLVQNFFNGKEPNRSINPDE-AVAYGAAVQAHIVSGG-- 383

125.m00218 VGGSTRIPRVQQLVQNFFNGKEPNRSINPDE-AVAYGAAVQAHIVSGG-- 383

Tb11.01.3080 VGGSSRIPAVQAQLRELFRGKQLCSSVHPDE-AVAYGAAVQAHVLSGGYG 389

92.m00252 VGGMTRMPKVVEAVKQFFG-REPFRGVNPDE-AVALGAATLGGVLRGD-- 405

Tb07.29K4.60 VGAMTRDPPIRHLLTEYFGRHVESEASCPADYAVAIGAAVRGAMLQGGFD 445

Tb09.160.3090 IGGATRMPKLQERLSEGYG-KPADRTLNSDEAVVSGAALMIHDTLSRIRV 486

Tb11.01.3110 KSKQTEGLLLLDVAPLTLG------IETAGGVMTALIKRNTTIPTKKSQI 430

TRYPtp2h24gd03.q1k_1 --------------------------------------------------



TRYPtp30n4hh05.p1k_3 KSKQTEGLLLLDVAPLTLG------IETAGGVMTALIKRNTTIPTKKSQI 430


TRYP_xi-1015g04.q1k_13 ESEVGGRVVLVDVIPLSLG------IETVGGVMTKLIERNTQIPTKKSQV 453

Tb07.29K4.620 KSKQTKDLLLVDVTPLSLG------VETAGGVMSVLIPRNTSVPAQKSQT 427

125.m00218 KSKQTKDLLLVDVTPLSLG------VETAGGVMSVLIPRNTSVPAQKSQT 427

Tb11.01.3080 ESSRTAGIVLLDVVPLSIG------VEVDDGKFDVIIRRNTTIPYLATKE 433

92.m00252 ----VKGLVLLDVTPLSLG------IETLGGVFTRMIPKNTTIPTKKSQT 445

Tb07.29K4.60 DLLSNTRFVTGTAQALKQGGFLRRCCNRIGSLVSSSVNPNAIGQRWRGRA 495

Tb09.160.3090 MESLTNDIYFTASPPIKES------NETKPHRNLLFAKRNTTVPAARSLI 530

Tb11.01.3110 FS--------TYSD-----NQPGVHIQVFEGERTMTKDCHLLGTFDLSGI 467

TRYPtp2h24gd03.q1k_1 -----------------------------------------MGTFDLSGI 9


TRYPtp30n4hh05.p1k_3 FS--------TYSD-----NQPGVHIQVFEGERTMTKDCHLLGTFDLSGI 467


TRYP_xi-1015g04.q1k_13 FS--------THAD-----NQPGVLIQVYEGERQLTKDNRLLGKFELSGI 490

Tb07.29K4.620 FS--------TNAD-----NQRSVEIKVYEGERPLVSQCQCLGTFTLTDI 464

125.m00218 FS--------TNAD-----NQRSVEIKVYEGERPLVSQCQCLGTFTLTDI 464

Tb11.01.3080 YS--------TVDD-----NQSEVEIQVFEGERPLTRHNHRLGSFVLDGI 470

92.m00252 FS--------TAAD-----NQTQVGIKVFQGEREMASDNQMMGQFDLVGI 482

Tb07.29K4.60 KG--------LSDEEIANYAKELVEFEAACDRRLLLERAENDANFVMRRV 537

Tb09.160.3090 FPNRTADFTLTLHDGNGRYSRSVLVSGVSGSMNAAREKEKEMSTERANKV 580

. :

Tb11.01.3110 PP------------------------------------------------ 469

TRYPtp2h24gd03.q1k_1 PP------------------------------------------------ 11

Tb11.01.3130 PP------------------------------------------------ 469

TRYPtp30n4hh05.p1k_3 PP------------------------------------------------ 469

Tb11.01.3120 PP------------------------------------------------ 469

TRYP_xi-1015g04.q1k_13 PP------------------------------------------------ 492

Tb07.29K4.620 PP------------------------------------------------ 466

125.m00218 PP------------------------------------------------ 466

Tb11.01.3080 TP------------------------------------------------ 472

92.m00252 PP------------------------------------------------ 484

Tb07.29K4.60 TADSSKRQGMQEKRVRQLSEQLKFWQYMVHNFHDHEDELLRTVRELEQAL 587

Tb09.160.3090 TKTS---------------------------------------------- 584

.

Tb11.01.3110 ------APRGVPQIEVTFDLDANGILSVSAEEKGTGKRNQIVITNDKGRL 513

TRYPtp2h24gd03.q1k_1 ------APRGVPQIEVTFDLDANGILSVSAEEKGTGKRNQIVITNDKGRL 55


TRYPtp30n4hh05.p1k_3 ------APRGVPQIEVTFDLDANGILSVSAEEKGTGKRNQIVITNDKGRL 513


TRYP_xi-1015g04.q1k_13 ------AARGVPQIEVTFDVDENSILQVSAMDKSSGKKEEITITNDKGRL 536

Tb07.29K4.620 ------APRGKPRITVSFDVNVDGILVVTAVEETAGKTQAITISNDKGRL 510

125.m00218 ------APRGKPRITVSFDVNVDGILVVTAVEETAGKTQAITISNDKGRL 510

Tb11.01.3080 ------AKHGEPTITVTFSVDADGILTVTAAEELGSVTKTLVVENSE-RL 515

92.m00252 ------APRGVPQIEVTFDIDANGICHVTAKDKATGKTQNITITAHG-GL 527

Tb07.29K4.60 DELEGLAEDNTSGLTTAGTVDFSSVTPVNHCEEEERDCSSVSAASRSAQL 637

Tb09.160.3090 ------VVLRQVEVVVEVVLSRSGLPYVAGSYVHARYAEQVTVLPSVKKT 628

. : . :. ..: * . :

Tb11.01.3110 SKADIERMVSDAAKYEAEDKAQ------------------RERIDAKNGL 545

TRYPtp2h24gd03.q1k_1 SKADIERMVSDAAKYEAEDKAQ------------------RERIDAKNGL 87


TRYPtp30n4hh05.p1k_3 SKADIERMVSDAAKYEAEDKAQ------------------RERIDAKNGL 545


TRYP_xi-1015g04.q1k_13 SEEEIERMVREAAEFEDEDRKV------------------RERVDARNSL 568

Tb07.29K4.620 SREQIDKMVAEAEKFAEEDRAN------------------AEKIEARNSV 542

125.m00218 SREQIDKMVAEAEKFAEEDRAN------------------AEKIEARNSV 542

Tb11.01.3080 TSEEVQKMIEVAQKFALTDATA------------------LARMEATERL 547

92.m00252 TKEQIENMIRDSEMHAEADRVK------------------RELVEVRNNA 559

Tb07.29K4.60 RTAHGDGKLKERTQDEEGEKPKGRKIMRRAVPLPRASAEAQELVEAGHPA 687


Tb09.160.3090 GDNETTAQKDENNNPSQNETDTTSTIS-----------PGREKRSGGSPS 667

. : .

Tb11.01.3110 ENYAFSMKNTINDPN-VAGKLDDADKNAVTTAVEEALR------------ 582

TRYPtp2h24gd03.q1k_1 ENYAFSMKNTINDPN-VAGKLDDADKNAVTTAVEEALR------------ 124


TRYPtp30n4hh05.p1k_3 ENYAFSMKNTINDPN-VAGKLDDADKNAVTTAVEEALR------------ 582


TRYP_xi-1015g04.q1k_13 ESVAYSLRNQVNDKDKLGGKLDPNDKAAVETAVAEAIR------------ 606

Tb07.29K4.620 ENYTFSLRSTLSDPD-VQQNISQEDQQKIQTVVNAVVN------------ 579

125.m00218 ENYTFSLRSTLSDPD-VQQNISQEDQQKIQTVVNAVVN------------ 579

Tb11.01.3080 TQWFDRLEAVMETVPQPYSEKLQKRIAFLPHGKEWVGT------------ 585

92.m00252 ETQANTAERQLTEWK----YVTDAEKENVRTLLAELRK------------ 593

Tb07.29K4.60 LRGADVSMTESTRSAFFEAQVEERAWREPPTPPGEHGS------------ 725

Tb09.160.3090 AANSNSAKMQNSRADEAKENETPTGDEILEVNERDAGTGGKNNNAKVRHF 717

Tb11.01.3110 --WLNDNQEASLDEYNHRQKE--LEGVCAPILSKMYQGMGGGDAAGGMPG 628

TRYPtp2h24gd03.q1k_1 --WLNDNQEASLDEYNHRQKE--LEGVCAPILSKMYQGMGGGDAAG---- 166

Tb11.01.3130 --WLNDNQEASLDEYNHRQKE--LEGVCAPILSKMYQGMGGGDAAGGMPG 628

TRYPtp30n4hh05.p1k_3 --WLNDNQEASLDEYNHRQKE--LEGVCAPILSKMYQGMGGGDAAGGMPG 628

Tb11.01.3120 --WLNDNQEASLDEYNHRQKE--LEGVCAPILSKMYQGMGGGDAAG---- 624

TRYP_xi-1015g04.q1k_13 --FLDENPNAEKEEYKTALET--LQSVTNPIIQKTYQSAGGGDKPQ---- 648

Tb07.29K4.620 --WLDENRDATKEEYDAKNKE--IEQVAHPILSAYYVKRAMEQAPP---- 621

125.m00218 --WLDENRDATKEEYDAKNKE--IEQVAHPILSAYYVKRAMEQAPP---- 621

Tb11.01.3080 --QLHTYTDAASIEAKVAKIERLAKRALKSARREGKDGWAPGNEDNGSGD 633

92.m00252 --VME-NPNVTKDELSASTDK--LQKAVMECGRTEYQQAAAANSGS---- 634

Tb07.29K4.60 --WQEVKRAVDAGEPVGSPIG--LQELQRPMTHEEMLQVLNNIAPIDDPV 771

Tb09.160.3090 ALRFPLNNTPAPSSTSQGGVNMNKEEALAARNRLRALQRLDDERLRRSGL 767

. :

Tb11.01.3110 GMPGGMPGGMPGGMGGGMGGAAASSGPKVEEVD----------------- 661

TRYPtp2h24gd03.q1k_1 GMPGGMPGGMPGGMGGGMGGAAASSGPKVEEVD----------------- 199

Tb11.01.3130 GMPGGMPGGMPGGMGGGMGGAAASSGPKVEEVD----------------- 661

TRYPtp30n4hh05.p1k_3 GMPGGMPGGMPGGMGGGMGGAAASSGPKVEEVD----------------- 661

Tb11.01.3120 ----GMPGRYARWYARRNGWWDGRRCGIVRA------------------- 651

TRYP_xi-1015g04.q1k_13 ---------------------------PMDDL------------------ 653

Tb07.29K4.620 ---------------APPSGEGEGNAPVPDDVD----------------- 639

125.m00218 ---------------APPSGEGEGNAPVPDDVD----------------- 639

Tb11.01.3080 DNDGDDNSDEDDELQRGRGVTEGSGRPPIRKRDRIEAINANTE------- 676

92.m00252 ---------------SGSSSTEGQGEQQQQQASGEKKE------------ 657

Tb07.29K4.60 SEEHARKRDHSIDMRTMTIVEGAVDMVALQELLEEEAKRAEELQRAQKKG 821

Tb09.160.3090 LNDVESLLLHYKSLDAWSAQQSDDNSNDWRSVVKDVSRWFEEVGGDVNVT 817

Tb11.01.3110 ----------------

TRYPtp2h24gd03.q1k_1 ----------------

Tb11.01.3130 ----------------

TRYPtp30n4hh05.p1k_3 ----------------

Tb11.01.3120 ----------------

TRYP_xi-1015g04.q1k_13 ----------------

Tb07.29K4.620 ----------------

125.m00218 ----------------

Tb11.01.3080 ----------------

92.m00252 ----------------

Tb07.29K4.60 EKQLVADSSAKLFAMD 837

Tb09.160.3090 ELQKQYQRLKDLKLGE 833

Chapter 8

Future work, discussion and

conclusions

This chapter summarises the contents of the thesis (Section 8.1) and extends our arguments

on data sharing. There is a comparison of the approach we have taken with other possible

alternatives (Section 8.2.1). There is a discussion of digital archiving in the future for

biomedical data (Section 8.2.2), the role of standards (Section 8.2.3), and the immediate

future work that leads on from our research (Section 8.3). Finally, there will be a summary

of the contribution this research makes to the field of functional genomics (Section 8.4).

8.1 Summary of thesis

Chapter 1 introduced the concept of a functional genomics investigation and described the

main experimental techniques that encompass transcriptomics, proteomics and other new

developments. Chapter 2 covered the computational techniques used to aid the challenges

presented by the experiments. The availability of databases, the creation of ontologies and

data standards were discussed. The development of a proteome standard was discussed in

Chapter 3, An object model for proteomics, which covered the past efforts (PEDRo), our

own object model Gla-PSI, and gave a snapshot of the next version, PSI-OM.

It is our view that gene and protein expression experiments should ultimately be de-

scribed in the same format. Chapter 4 described the integration of the object models for

microarrays and proteomics to create a new model, FGE-OM. The chapter also contained

a description of how the future development of standards should take place, and of the role

of ontologies. The RAPAD database was described in Chapter 5. The database has several

functions. Firstly, it demonstrates that an established microarray database can be extended

to support proteomics, and that integration of results from gel electrophoresis and microar-

rays is possible. Secondly, the database serves as a prototype for a public repository of

254

Chapter 8. Future work, discussion and conclusions 255

proteomics data, and components from RAPAD are currently being incorporated into the

GUS platform for functional genomics. Thirdly, the relational implementation verifies that

the object model, FGE-OM, correctly models proteomic experiments. Finally, the database

has been evaluated with two projects in Glasgow University. One project demonstrates the

differential expression of proteins, comparing a human cell line invaded with Toxoplasma

gondii with non-invaded cells, described in Chapter 6. The second project catalogues the

proteome of the African parasite Trypanosoma brucei, described in Chapter 7. Additional

software was specifically tailored to provide novel visualisations of data, and to summarise

complex information.

8.2 Discussion

This section compares our approach with possible alternatives, and discusses the contribution

the thesis makes in digital archiving, publication of data, and data standards.

8.2.1 Alternative approaches

The initial goal of our research was to utilise software engineering techniques to improve

the database facilities for data storage and querying. There was also a wider requirement

to support future re-analysis and publication of proteome data. In this section, there is a

critical analysis of our approach, and a description of alternative methods that could have

been employed.

Extending existing technology

Our approach has been to use a combination of object modelling to describe the biological

workflow (Gla-PSI and FGE-OM), and relational technology (RAPAD) to store and query

proteome data. There is a third component, a data exchange format expressed in XML,

that has been discussed, but has not yet been implemented. We have re-used and extended

well established technology (MAGE-OM and RAD), which we believe is advantageous for

the following reasons. In general, the time to develop a large system should be greatly

reduced if previously existing technology is extended, such as RAD into RAPAD, instead

of developing from scratch. This claim is difficult to quantify, but the RAPAD schema

and graphical interface were developed in approximately seven or eight months, primarily

by the author, in consultation with developers at the University of Pennsylvania. We do

not believe that a comparable system could have been developed entirely from scratch in a


similar length of time. Another reason for extending a microarray database into proteomics

is that because the two technologies share parts of a database schema, and have a similar

user interface, integration across the two domains is facilitated. For example, biological

samples must only be described once, even if both microarrays and proteomics experiments

have been performed. This allows for the execution of queries with regard to the samples,

and for the retrieval of relevant microarray and proteome studies.

However, there are dangers in re-using technology. In general, if a design or programming

error originates in one system, it may be inherited in another system and not detected. It is

also possible that adapting existing technology leads to the creation of a system that is not

optimal for the users. A hypothetical example would be that a system tailored to capture

microarray protocols could be adapted adequately to record proteomics experiments, but the

interface is not intuitive for the user. We believe that we have avoided these problems during

the development of RAPAD due to the amount of interactions that took place between the

author and the experimentalists during the development.

The use of object models

We have taken the approach that data standards should be developed in UML (Unified

Modeling Language) for object modelling, and that an XML representation can be derived

from the model. This means that ultimately there are three parallel technologies that must

be managed: an object model, an XML Schema, and a relational database. Software is

required to create XML from the object representation, and to process data for database

entry. An alternative approach would be to model the experiments only at the level of XML,

by writing an XML Schema. This approach has the advantage that fewer technologies have

to be managed because the object model would not be required. In addition, any developer

can read and edit an XML Schema without the kind of specialist software that is required

for editing object models. However, it is generally believed that object models for complex

domains are easier to understand and develop than XML Schemas [47]. If a developer has a

basic knowledge of UML, the contents of an object model can usually be understood fairly

easily, because class diagrams and use-case scenarios give a graphical representation of the

system. The only way to comprehend an XML Schema is by reading hundreds of lines of

text, which for a large, complex domain is not feasible. An XML Schema is designed for

machine processing, and while in theory it may be human readable, this is not the intended

usage. There are tools for automatically creating an XML Schema from UML [96], although


software is still required to create correctly formatted data in XML files.

A data standard could alternatively be expressed as objects and classes in a software

system. This approach has the disadvantage that the data standard is tied to a particular

programming language that can only be fully understood by an expert in that language.

Furthermore, programming code cannot be represented graphically unless it is converted

into a UML representation and imported into an object modelling tool. In this case, it

is often beneficial to start with an object model and derive code, rather than developing

a system and then reverse engineering. We believe that the complexity of the proteomics

domain means that the advantages of utilising an object model for representing the standard

far outweigh the disadvantages.

Database management systems

RAPAD is a relational system that uses Oracle for database management. We have proposed

that XML will be used for data transfer, and that software will be created for processing the

exchange language, such as PSI-ML, for entry of data into the relational database. The use

of a relational database management system (RDMS) has considerable advantages. Firstly,

there has been substantial research into improving query performance for relational databases

over the last three decades, and an RDMS is more secure and less likely to lose data than

alternative solutions [89]. However, there is a growing trend towards the storage of XML in

its native format, rather than converting XML formatted data into a relational representation

[104]. There are a number of arguments that are too detailed to address satisfactorily here,

but the main point is that an XML Schema can be evolved and changed fairly easily. XML

files formatted in the new version can be stored immediately in the native XML store with

minimal additional effort. In contrast, relational databases must be stable for significant

lengths of time. Furthermore, the hierarchical tree structure of XML may represent some

concepts more naturally than the tabular representation in a relational database. There

are a number of XML database management systems that can either be purchased [313],

or that are freely available [14], although none are likely to offer the query performance or

security features of an RDMS. These issues are discussed in more detail in the report on

XML indexing in Appendix A.


8.2.2 Digital archiving and publication of life science data

It is our view that the model of publishing data only in journals is no longer sufficient

in the post-genome era. As little as 20 years ago, the scientific record consisted of libraries

containing journals that had to be searched manually, with some form of index. The situation

has now improved, as most journals are published electronically, and can be searched online

using information retrieval techniques. This model is still far from ideal because data sets, if

present at all, are embedded within the text, or in a tabular format, and the context of the

results can only be interpreted by reading the article and understanding the methods used.

Furthermore, data may be presented as an image that cannot be searched at all, and rarely

can text be extracted from images that are published online. This is not to say that journal

articles will become redundant. The publication of textual descriptions of an experiment will

always be required to offer an interpretation of the results and to position the work within

its context, but new methods are required to disseminate a digital record of the experiments.

The policy of journals

A functional genomics data set contains far more information than is conveyed in the original

publication. Different statistical models can be applied to mine new information from data,

and importantly, the results may be useful to research groups who were not aware of the

original publication. For example, the differential expression of proteins presented in Chapter

6 may be published in a parasitology journal. However, one of the study conditions is the

expression of proteins in a non-infected human cell line, which could be useful to other

researchers using the same cell line in any other field of research. Bioinformatics developers

must create systems that ensure data sets are available for a considerable length of time.

Public databases must be created for large scale experiments that are capable of being

queried for not only the results, but also the protocols, and structured descriptions of the

sample used. The editors of journals should employ a policy that a publication will only be

accepted once an electronic record of the experiment has been deposited in a public database.

The journal Molecular and Cellular Proteomics has recently released a set of guidelines for

authors wishing to publish articles in which a large number of proteins have been identified

by MS [48]. Authors must make available all the information that allows others to evaluate

the probability that proteins have been correctly identified. The journal also encourages

authors to deposit the raw spectra as supplementary material. However, the journal’s policy

is that they will not offer a database facility, and currently there are no public repositories


that are widely used. It is therefore left to authors to place spectra on their own web sites,

although the spectra will be in a format that cannot be interpreted by most other researchers.

This exemplifies the problems that hinder standardisation efforts at present. The problems

will only be solved when a standard format is agreed and public databases are developed.

The RAPAD prototype presented in Chapter 5 demonstrates that deposition and publi-

cation of complex proteomic data sets is possible. The integration with microarray results

has also been demonstrated, and will lead to the creation of a proteomics namespace in GUS.

GUS will provide access to a wide range of functional genomics data, to accompany journal

publications, and will give added value to data because it can be analysed within the context

of all the other results stored within the system. The current RAPAD implementation is a

significant intermediate step towards a public data repository for proteomics.

Archival of raw data

The software for normalising microarray data sets, gel spot detection and quantification of

proteins from gel images, continues to improve. It is worth noting that very few public

databases store raw data: neither the original scans of microarrays and gels, nor the coordi-

nates from a mass spectrometry trace. Instead, a processed version of data is stored, which

may have undergone several different statistical or software manipulations that cannot be

reversed. The processed version of the data will be sufficient for the majority of users, but

this will prevent any future developments in the algorithms for processing raw data being

applied across massive data sets. One exception is that the original DNA traces from genome

sequencing projects are stored1, although the traces can only be browsed by species name

and downloaded in bulk, which leaves the user with a large data handling task. The cost of

data storage continues to decrease rapidly, while bandwidth continues to increase. Further-

more, the emergence of Grid technology means that high-performance computing will soon

be available to many bench researchers. Therefore, it is worth considering whether public

databases should store raw data from all published studies, even if it cannot be queried

at present. The alternative is that researchers who wish to publish must “guarantee” to

make raw data available on request, but this is very difficult to enforce, and prevents any

automatic assembly of data sets. The first version of MIAME stated that the storage of raw

image data was not an absolute requirement. The next versions of MAGE-OM and PSI-OM

are currently in development, and this issue should be discussed. It could be argued that the

1Ensembl [95] and the NCBI [225] both have a server that allows downloads of sequencing traces.


cost to perform microarray experiments continues to fall, and therefore data sets could be re-

obtained if required. There are also several different array platforms that cannot be directly

compared, therefore raw data may be re-evaluated infrequently. In the proteomics field, it

has yet to be demonstrated whether it is feasible to compare large numbers of gel images

produced with different protocols, although this is being investigated in the ProteomeGRID

project [256]. However, we believe that the potential benefits of being able to assemble very

large data sets in the future could be great, and standards organisations should reconsider

whether storage of raw data should be a prerequisite for publication. If this policy is to be

realistically enforced, the public databases must develop facilities for archiving large files,

and query facilities for finding the correct files when required.

If raw data is to be made widely available, there must be a significant change in philosophy

for bench researchers. Many researchers are wary of releasing entire raw data sets because

in the initial publication they may not have covered all the possible results that could be

deduced, and they are not willing to lose ownership of the data and any future publications

that result. Furthermore, the release of raw data to the public allows other groups to verify

the correctness of the entire analysis, which may not be welcomed by all, although in the long

term should benefit the life sciences. The release of raw data does not initially benefit the

research group that owns the information. Therefore, data sharing will only be encouraged

if journals prevent authors from publishing if data sets have not been made available. It is

crucial that research is not hindered by these requirements, and significant efforts must be

made by computing science to aid the archival of data.

8.2.3 The role of data standards

The goals of standards organisations fall into different categories depending on the kinds

of data that are represented. The protein-protein interaction community required a format

that enables data transfer between the major databases, similar to the agreements between

GenBank, DDBJ and EMBL for genomic data, and therefore PSI rapidly created a format

that is agreed upon by all major parties [150]. In this case, standardisation is fairly simple,

where there is an obvious, immediate need for a solution that may not be found without

the intervention of an official body. The situation for other functional genomics experiments

is more complex. PSI and MGED are attempting to generate data formats that are a

digital representation of the experiment: results, methods, and analysis. For most users a

data format that facilitates deposition in public repositories is required. The format should


capture three vital components:

1. The data from the experiment, stored in a way that can be easily retrieved and ma-

nipulated.

2. Information about the purpose (hypothesis) and overview of the experiment.

3. The methods used that allow the experiment to be fully understood or notionally

reproduced, although functional genomics experiments are rarely reproduced due to

cost and lack of identical biological samples.

We believe there are clear advantages of storing a structured description of 1 and 2. The

data from the experiment (1) is required for most users, and due to the size of data sets,

it must be in a format that can be processed easily. Furthermore, querying a free text de-

scription of the purpose of the experiment (2) is far from sufficient, and therefore it must be

well structured, preferably using ontologies. The overview of the experiment must include

the samples used, because it is usually only in the context of the sample that the results

have any meaning. It is generally agreed that a structured description of 1 and 2 is a re-

quirement for the next versions of MAGE-OM and PSI-OM. MAGE-OM also attempts to

capture all parts of an experiment, including those that fall into category 3, the methods,

in a structured format. One of the reasons is that MAGE is also used for sample tracking

in some laboratories, and that by storing a structured description of the entire experiment,

automated comparison between different experiments is possible, for instance to isolate a

batch of reagents that is contaminated if a particular assay fails. However, while MAGE is

useful in this context, it could be argued that a complete breakdown of the protocol is not

required for public databases, because it is very unlikely that it will ever be queried. Further-

more, proteomics experiments are complex, and new technologies are frequently developed,

therefore structuring entire protocols is even harder, and may hinder the development of a

finalised format. Therefore, the initial development of the standard for proteomics should

focus on describing the experimental overview and the core data.

8.2.4 A functional genomics standard

The design of FGE-OM makes use of concepts from PEDRo and MAGE, and for certain

parts of experimental protocols, information can be stored either using the well defined

PEDRo classes, or using generic MAGE derived classes. A finalised standard should not be

ambiguous about how concepts can be represented, otherwise it will become more difficult to


process and query the format. The creation of an integrated functional genomics format is

in development, and these issues are still open for discussion. One of the main criticisms of

MAGE is that it is “over-engineered”, and too complex for many developers to use. PEDRo

can be understood fairly easily by both developers and experimentalists, but in practice

real experiments cannot always be captured adequately. Our view is that a compromise is

needed. It is vital that a well structured description of data, the starting sample and the

experimental overview is created. In terms of general experimental protocols, there is a need

for some large laboratories to have a structured description of how protocols were performed,

for auditing and sample tracking, but it may not be essential for public databases. A data

format should not necessarily be an electronic version of the entire experiment, for example

it is very unlikely that users will ever need to query a public database for the voltage applied

during the second stage of electrophoresis. In contrast, users may want to query a database

for the pH range of a gel. The protocol must be deposited in the database, and there should

be no additional overhead of doing so, because methods are required in a journal publication

anyway, but for some parts of the experiment, simply depositing the text of the protocol

within the database will be sufficient. One of the goals of the standards bodies will be to

decide which parts of protocols should be well structured using ontologies, and which parts

can be described in plain text.

FGE-OM is not intended to be a finalised working object model, but was released to

instigate more collaboration between PSI and MGED, and to provide a framework for an

integrated data standard for functional genomics. This goal has been achieved as it is

now planned that the proteomics standard will incorporate a “core” module, based on the

BioOM namespace in FGE-OM, which is being created by MGED, to capture an experimen-

tal overview and descriptions of biological samples.

8.2.5 Proteomics standards

The official standard, PSI-OM, is currently in development. The author has contributed to

this development at meetings of the PSI [257], and directly through the design of a model

of gel data. The Gla-PSI model, described in Chapter 3, was released to incorporate a more

detailed data model into the PEDRo proposals, as 2-DE is widely used, and a structured

representation of the data, and the methods used to obtain it, is essential to allow for future

re-analysis. It is vital for developing a community standard that a wide range of users and

developers are consulted, to allow for continual refinement and improvements to the object


model.

8.2.6 A vision for future data sharing

The diagram in Figure 8.1 displays a possible organisation of future data publication and

sharing. The main concept is that a generic model will be defined, similar to FGE-OM.

Software will be developed that can create FGE-ML, a mark-up language based on the

format, used for formatting experiments that cover a wide range of technologies. The main

benefit is that biological samples need only be described once, regardless of the down-stream

processing and analysis. For many users, this amount of complexity will be unnecessary, and

smaller subsets of the formats can be created for single technologies, MAGE-ML (version 2)

and PSI-ML (version 1). There will be various databases that can accept either the single

technology formats, or the wider FGE-ML specification, depending on their scope. FGE-

ML, or the derived formats, can also be used for transferring information between databases.

Ontologies will be used for populating the data format and the databases, improving the

facilities for querying, and asking questions of data semantics (reasoning). The ontologies

must be accessible to programmatic access, and date stamped, to ensure that there is an

exact specification of which version of a term is being used. It is vital that ontologies remain

accessible in the future, to ensure that URLs referencing the source of a term do not become

out of date, causing a data set to become incomplete.

8.3 Future work

There are several areas in which efforts are continuing that follow on from the work pre-

sented in the thesis. These include proteomics standards, a functional genomics standard,

and the development of a public repository. The first version of the official standard of the

Proteomics Standards Initiative, PSI-OM, must be defined in the near future. PSI-OM will

be presented and discussed at meetings for computer systems developers, and at biological

conferences, to ensure that it has wide community input, and that it adequately covers all

the technologies that are currently used. The standard should be designed to accommodate

future developments. The object model will only be one component of a successful data

standard, and it is also vital that an ontology is developed in tandem. It is planned that

there will be a PSI endorsed ontology, PSI-Ont, that will contain terms describing parts of

experimental techniques that are difficult to represent in the object model. PSI-Ont will

also contain certain data values, such as types of units, the names of instrument manufac-


MGED + PSIOntology

FGE−MLtransfer ofcompletedatasets

Software installed locallyfor creating and interpreting FGE−ML

Download terms forinstantiating the model

Terms used for querying the database andpopulating drop−down menus for data entry

MAGE−MLderived frommain model

PSI−MLderivedfrom model

FGE−MLtransfer ofcompletedatasets

MAGE−ML

NCBITaxonomy

PSI−ML PSI−ML

Databases

Laboratory

LIMS forcapturing data

OntologyGene

Ontologies

SpeciesMicroarray Functional Proteomics

ArrayExpress RAPAD PlasmoDBCEBS

genomics

Figure 8.1: A possible model for future data sharing and exchange. Example databases arein italics.


turers and so on. This part of the vocabulary is required because the usage of these terms

should be controlled, to allow future querying of the format, and there are many components

that should not be “hard-coded” in the object model. Detailed discussions are required to

converge on a data standard that captures the requirements of research as it is now, and

ensures that future re-analysis of archived data is possible. Another consideration is that the

developers of an object model should not attempt to create a format that is highly descrip-

tive of every possible technology but is too complex to use. The development of PSI-OM

and PSI-Ont are well underway, and it is likely that a finalised data standard will be agreed

within the next year.

We envisage that the fields of proteomics, microarray technology, metabolomics, and

other functional genomic techniques should converge on a single data standard. We offer

FGE-OM as a framework from which future development can start. Of primary importance

is the definition of a notation that describes biological samples from any type of experiment in

a structured format. At this stage it may be too ambitious to design an all encompassing data

standard before a complete format has been widely accepted for proteomics or metabolomics.

However, the development of standards should take place with the future integration with

MAGE-OM in mind. In particular, there is limited benefit attempting to develop different

structures for describing the experimental overview and biological samples. It is essential that

PSI members continue to attend MAGE meetings, and that MAGE developers contribute

to the on-going discussions on PSI-OM.

The RAPAD database supports the projects outlined in Chapters 6 and 7. However,

it is still a prototype of a public repository, as it has not undergone sufficient testing to

ensure that the code is free from error, and that data is completely secure for external public

access. One of the goals of developing RAPAD was to investigate the possibility of using

tables from the RAD schema to store proteomics experimental protocols. This facility has

been demonstrated, and in the near future a proteomics extension to GUS will be developed.

At the University of Glasgow, RAPAD will be extended to support the requirements of the

Functional Genomics Facility [293]. An extension for laboratory sample tracking, similar to

a LIMS (Laboratory Information Management System), and for customer billing is currently

in development by Dr Giorgia Riboldi-Tunnicliffe at Glasgow University.

The current implementation of RAPAD can record data from 2-D electrophoresis and

mass spectrometry (MS). The integration of microarray and proteomics results was demon-

strated in Chapter 6, however the facility for the complete storage of microarray experiments


has not yet been incorporated, as this was not a primary objective of our development. The

next round of development will ensure that microarray protocols and data can be recorded

in RAPAD. There are several additions to the proteome component that will be required in

the future. The database schema contains tables to store column separations and labelling

experiments, for capturing LC-MS or ICAT technologies (described in Chapter 1), but the

user interface has not yet been developed. Web pages will be created for entry of LC-MS

and ICAT protocols, and a separate interface will be required for the visualisation of the

results, and the querying of data. The interface will be an important component because

very large volumes of data are produced by LC-MS, which must be open to various types

of query. The code for the interface, and the next version of the database schema, will be

released via the web site at regular intervals, to allow other developers to test the code, and

to create similar systems in different locations.

It was reported in Chapter 5 that the MGED ontology has been stored and extended

in RAPAD. The ontology can be used for capturing types of experimental protocols and

details of biological samples, however the use of controlled vocabularies could be extended

further. For example, the complete Gene Ontology (GO) could be stored in RAPAD and

used to find the correspondence between different RNA sequences (from gene expression

studies), proteins (from proteome studies) and the functional annotations in GO. Chapter 2

described the vision of the Semantic Web and Grid based services for integrating different

types of applications. RAPAD could be described using ontologies and made available on the

Grid as a web service to allow applications to query the database for proteomic experiments,

in conjunction with other Internet accessible resources. This functionality is already being

investigated for RAPAD as part of a PhD project by Frank Gibson in the Department of

Computing Science at the University of Newcastle.

The development of technology to aid proteome research is likely to be mirrored by other

research communities, such as metabolomics. Various types of data must be capable of being

queried in parallel, and that data can feed into models at the level of whole biological systems.

It has already been demonstrated that microarray results can be used to hypothesise about

the components that interact within metabolic pathways [201]. It is likely that protein

abundance data and metabolic profiles will also contribute to the models, although this

will require considerable efforts from database engineers who will have to provide integrated

query systems for all the information. The systems biology approach, which aims to gain

insights into the functioning of biological processes, will only produce significant results when


the databases are in place which provide access to very large volumes of data. It is our view

that the development of centralised databases will be greatly hindered unless the efforts to

define standard formats are successful, and gain widespread acceptance.

8.4 Summary

The work presented in the thesis moves the field closer to the vision presented in Figure

8.1 in the following ways. The object model for proteomics (Gla-PSI) has contributed to

the community efforts that will lead to the creation of the first version of the official PSI

standard within the next year. In particular, PEDRo does not cover 2-DE data adequately,

which will now be covered in much greater detail. Biological samples were also not described

in sufficient detail in PEDRo, and will now be captured using classes from MAGE-OM, as

proposed in FGE-OM. FGE-OM can be viewed as an intermediate step towards a shared

data format for functional genomics, which will improve the facilities for integrating data

from the different techniques. The RAPAD system has a close correspondence with FGE-

OM, therefore the relational implementation demonstrates that the object model captures

proteomic workflows in practice.

RAPAD is a working database for proteomics, as demonstrated by the on-going investi-

gations reported in Chapters 6 and 7. RAPAD has aided the research process for these two

projects whose requirements are common to many proteome investigations. Therefore, the

software could be used in a variety of different domains. The thesis has reported that the

current facilities for releasing proteomics data sets on the Internet are not sufficient. The

RAPAD schema and interface code are freely available for download by other developers who

can implement and extend the system, to support the requirements of their own laboratory

data. RAPAD also serves as a prototype of a public repository of data, and the future in-

tegration of a proteomics namespace into GUS will greatly extend the facilities of the web

sites supported by the platform.

We believe that efforts to improve facilities for archiving and querying functional ge-

nomics data will bring significant benefit to the biological research community. There is a

danger that unless there are continued developments in bioinformatics, we risk losing impor-

tant information, as the computational technology lags behind laboratory innovation. Our

research will lead to greatly improved accessibility of proteome data, thereby maximising

the knowledge that can be gained from the experiments, and enabling new discoveries.

Appendix A

An XML indexing solution for data

integration

A.1 Introduction

Chapter 2 introduced the challenges in data integration for the life sciences, and there was a

brief description of the concept that an XML representation of data could facilitate integra-

tion across different databases. In this section there is a brief report of work undertaken by

the author in 2001. The work comprises a new method of indexing large volumes of biologi-

cal data represented in XML. The index was designed with several goals in mind: firstly to

develop a prototype system storing local versions of various biological databases, to improve

the information retrieval task for researchers. Secondly, to assess the feasibility of using a

persistent programming language to implement a large index, and to compare the perfor-

mance of the index against previously published studies. Thirdly, to test the capabilities of

a native XML store compared with a relational representation of biological data.

The first goal has been partially completed, and a local version of the DBLP bibliography

database [71] (100MB) and PIR [254] (Protein Information Resource, 800MB) were indexed

and fast queries have been demonstrated. A graphical query interface was also created. A

complete prototype system has now been fully realised in the Xtect project [359] by our

collaborators at the University of Glasgow and the University of Strathclyde, which was

published at the DILS (Data Integration in Life Sciences) workshop in 2004 [159]. The

initial prototype demonstrates that a persistent programming language can be used for

implementing a large index, however performance difficulties are encountered as the size

of the index grows, due to the method of caching1 employed by the persistent store. The

RAPAD database (Chapter 5) utilises a relational representation of proteomics data, which

1Caching is the process of loading frequently accessed objects into main memory.

268

Appendix A. An XML indexing solution for data integration 269

allows complex queries and provides a robust security model. However, the object models

for data standards continue to evolve, and it is not possible to make major changes to a

relational database after it has been deployed, without significant efforts. It is possible to

evolve the XML representation of the data standard with no significant effort, and therefore

a native XML storage solution would be advantageous because it could be tested with data

expressed in the latest version of the standard with minimal refactoring required.

The rest of this report is structured as follows. There is a brief introduction to previous

work on XML indexing, native XML databases and query languages in Section A.2. The

implementation of the index structure and the results of performance tests are described in

Section A.3 and discussion is in Section A.4.

A.2 Previous work

XML was designed to act as a format to exchange data over the Internet, and large volumes of

XML are produced in a number of different domains. XML can be automatically processed,

and has been a W3C standard for a number of years [101]. XML is now being used as a data

storage format, and a number of biomedical databases can be downloaded in XML. In the

initial specification, XML data was validated using a document type definition (DTD) that

specifies the tags that are allowed in a particular data set, to ensure that data is uniformly

formatted and can be processed. DTD has been superseded by XML Schema [341], which

incorporates the ability to specify the type of data (string, integer or floating point number),

along with several other changes.

The most advanced proposal for a query language for XML is XQuery [357]. XQuery is

based upon a previous proposal known as XPath that enables paths in XML (the hierarchy of

tags) to be specified in a formal manner. Other proposals include a graphical query language

[226], which enables users to build queries, developed from a graphical representation of the

tree structures of the source XML.

Several proposals have been made regarding storage of XML within databases. The

proposals include databases that store the native structure of XML, such as Tamino [313]

or Xindice [14]; and techniques for representing the structure of XML within tables in a

relational database [111, 288, 77, 364]. An investigation by Cooper et al. in 2001 [62]

encoded XML within a tree structure designed for fast retrieval of strings. The tag names

comprising the paths to data were stored as compressed strings, with textual data stored

in the same structure. The entire index, and the source data, were stored in a relational


database. The conclusion was that the new index structure showed significant improvements

in performance, and utilised a larger data collection, compared with the previous approaches.

The first version of our index structure is loosely based on the work of Cooper.

A recent proposal that indexes XML has been made by Buneman et al. 2003 [43].

This uses a representation that separates the textual data from the tree structure of XML

(the nested organisation of tags). Processing queries over the tree structure (referred to

as the Skeleton) takes up a large proportion of the time required to answer a query. The

authors show that by querying a compressed representation of the Skeleton in main memory,

significant improvements in query performance can be made. Tests have been performed

over collections of XML data, including Swiss-Prot, demonstrating effective query processing

against large collections of biologically relevant data.

A.3 Results

We have designed new structures to index XML, which have been implemented using Persis-

tent Java (PJama) [19]. PJama was developed at the University of Glasgow in collaboration

with Sun Microsystems [307], and allows objects, represented in Java, to be written directly

to disk. The objects can be accessed when required without needing specific methods to

read the objects from disk, allowing the encoding of very large structures that behave as

if they were represented in main memory. Two index structures have been created, Index

A and Index B, for XML versions of the bibliography database, DBLP [71] (100 MB), and

the PIR database (800 MB). The performance of the indexes created for DBLP were tested

for simple queries, and more complex Boolean queries in which a join was required. The

structure of the indexes is as follows.

A.3.1 Index A

Index A comprises four components: a Data Path Tree, many Data Stores, an XML Dictio-

nary and many XML Locater Lists (Figure A.1).

• The XML Dictionary stores the names of the tags (elements) encountered in the XML

document, assigning each one a unique integer code that is used in the rest of the index

for the purpose of compression.

• The Data Path Tree stores a summary of all the different types of XML paths encoun-

tered in the database. The following is an example of one data path (without closing


4 Author5 Volume6 Title7 Organism8 Commom9 Formal

Data Store for Path 1_2_3_4

(Protein_Entry:Reference:Authors:Author)

Data Store for Path 1_7_9(Protein_Entry:Organism:Formal)

m

u

s

m r

ID 1378.1ID 2356.9ID 1356.4a

1 Protein_Entry2 Reference3 Authors

.............

Root

o

m

o

h

s

18

17

1

2

3

4

78

9

6

XML Dictionary

.....

s

a

Data Path Tree Data Stores

g

(Data: Homo sapiens)For Data Path 1:7:9

r

a

XML Locater List

Figure A.1: Index A has four components: the Data Path Tree, Data Stores, XML LocaterLists and an XML Dictionary.

tags).

<PIR_Database><PROTEIN_ENTRY><PROTEIN_NAME>p53

The Data Path Tree contains a node for each element on the path, storing the integer

corresponding to the element’s name and a reference to the child node, if one exists.

• There is one Data Store for each path in the Data Path Tree. Each Data Store contains

all the textual data, found in the entire collection of XML documents, which can be

reached by a particular Data Path. For example, all the protein names would be stored

in one Data Store. The structure of the Data Store is a digital trie [295] that allows

fast searching of strings. Each node of the trie stores one character of the string and a

pointer to the child node (the next character in the string).

• XML Locater Lists contain a set of identifiers for the records from the source XML

dataset that contain a specific string in a particular Data Store. Hence, the identifiers

for all the database records that contain the protein name, p53, are stored in one XML

Locater List. The last node in the string (the “3” of p53) has a link to the XML

Locater List.


Index A is designed to support very rapid retrieval of simple query terms against an XML

record. For example to find all the documents that contain the protein name (p53), the

following algorithm is executed:

1. The user must know which path corresponds to the data to be searched for, in this

case it is PIR Database/Protein Entry/Protein Name. A fully functional application

would achieve this using a graphical user interface.

2. Look up the integer codes for each element in the XML Dictionary.

3. Search the Data Path Tree for a path that corresponds to the elements in the query.

4. When the leaf node is reached in the Data Path Tree, follow the link to the Data Store.

5. Match each character in the query string (p, then 5, then 3) in the Data Store.

6. Correct matches are obtained if there is an XML Locater List referenced from the “3”

node (of “p53”) in the Data Store. If it exists, the identifiers in the XML Locater List

can be used to find records in a database, or on the file system, which correctly answer

the query.

This is an extremely efficient structure for performing simple queries of this type because

once a query has been formulated, the number of object references that must be followed is

almost the minimum possible. However, there are two problems with this index structure,

which prevent it from being highly usable for many queries. The first is that the order of XML

documents can be crucial, and it would not be possible to specify a query of the type “find

leaf node A followed by leaf node B”, using this structure. Secondly, more complex queries of

the type “find a record containing leaf node A with value X AND leaf node B with value Y”,

are performed inefficiently. This query can only be answered by performing two runs through

the process described above, and combining the two sets of results to find the cross-product

(the results present in both sets). If the two sets of results are large, the combination part

of the query would take a considerable length of time (discussed further in Section A.3.4).

An attempt to alleviate these two problems, without reducing the performance of simple

queries, resulted in the design of Index B.

A.3.2 Index B

Index B retains the Data Path Tree, the Data Stores and the XML Dictionary from Index

A but includes an additional component, the Structure Container that stores a compressed


8

Data Store for Path 1_2_3_4

(Protein_Entry:Reference:Authors:Author)

Data Store for Path 1_7_9(Protein_Entry:Organism:Formal)

m

u

s

m r

Root

o

m

o

h

s

18

17

1

2

3

4

78

9

6

s

a

Data Path Tree Data Stores

gr

a

1578.9

97

1

17 9

Structure Container

1817

63

2 1356.4a

2445

998.4b

Data Pointer

Locator Pointer

Locater ID

Figure A.2: Index B has four components: the Data Path Tree, Data Stores, the StructureContainer and the XML Dictionary (not shown).

representation of the structure of every XML record (Figure A.2). An XML record is rep-

resented in the Structure Container by a set of nodes that contain the integer code of the

element, and pointers to child and sibling nodes. The order of elements in the Structure

Container matches the original XML record. A leaf in the Structure Container has a pointer

to a node in the corresponding Data Store, which represents the final character of the textual

string. An entry in the Structure Container is effectively a compressed version of an XML

record and can be used to reconstruct the source text. The textual values in the Data Store

can be obtained by reading backward from the leaf node to the first node (e.g. reading 3,

then 5 then p for p53). The Structure Container also stores the ID number of the record,

which is referenced directly from the leaf nodes in the Data Store. The XML Locater Lists

are not required and the records do not need to be stored in another file system or database

because the Structure Container can be used to reconstruct an exact copy of the source

XML.

A query of the type “find all the documents that contain the protein name p53” is per-

formed in exactly the same way as for Index A. A Boolean “AND” query is performed using

a more complex algorithm that is more efficient than performing two individual queries and


8

Root

o

m

o

h

s

18

17

1

2

3

4

78

9

6

s

a

Data Path Tree

gr

a

97

1

17 9

Structure Container

1817

63

2

Data Pointer

i) Search for data path

for structure of query

v) For matched structuresfollow pointers back to Data Stores

m

u

s

m r

1356.4a

Locator Pointer

iii) Follow Pointers tostructures

Data Store

ii) Search Data Store

iv) Search Structure Containers

vi) Match data in Store

Figure A.3: The method used to implement a join query in Index B is implemented in a sixstage algorithm.

finding the cross-product of results, as was required for Index A. The basic concept is that

the first term is evaluated via the Data Path Tree as before. Records matched by the first

term are retrieved from the Structure Container, and only those that contain the path cor-

responding to the second term of the query are evaluated further, the rest can be thrown

away. The value of the second query term is searched in the corresponding Data Store, as

displayed in Figure A.3

A.3.3 Index creation

The time to construct Index A and B is given below:

Total Records 1000 5000 64,000 240,000

Size of XML (MB) 0.4 2.1 25 100

Index A (s) 11.2 57.8 742 3240

Index B (s) 12.6 71.8 833 3521

Store Size (MB) 17 40 281 1014

Table A.1: Build times in seconds for Index A and B for four different sizes of data set. Thesize of the persistent store for Index B is given in MB.


Table A.1 displays the time in seconds to build Index A and B on disk. Index B takes

approximately 10% more time to build than Index A. The build times for Index A and B

appear to grow linearly with the size of data. The store size grows proportionally less than

linearly with regard to collection size because as the data size becomes large the tries in the

Data Stores become saturated. In other words, when data is added to a larger store, it is

more likely that the same string of characters is already entirely (or partially) present in the

store, and only a new XML location object has to be added. Secondly, the XML Dictionary

and Data Path tree grow very rapidly at the start of a build but will only subsequently grow

when new types of documents are encountered. A version of Index B was also created for

PIR database in XML as a single 800 MB file, giving rise to a persistent store of size 5.7 GB.

A.3.4 Queries

The following searches were carried out over the indexes stored persistently on disk, and in

main memory, for the DBLP database.

1. Search for 10,000 authors’ names.

2. Search for 10,000 sets of two authors’ names in a single bibliographic reference.

3. Search for 10,000 sets of the author’s name and the year of publication.

Query sets 1 and 3 were obtained by selecting authors’ names (and the author and year

pairs) at random from a query retrieving all records from the collections. Set 2 was obtained

by selecting records at random from a query that retrieved all records containing two or

more authors. The searches are similar to the queries in the publication by Cooper et al.

[62]. While several parameters are not the same between these tests and those discussed by

Cooper, the tests are intended to give an initial benchmark, against which future comparisons

could be made. Queries were carried out on a four processor Enterprise 450 SUN Solaris

CPU, with a clock speed of 300Mhz. The Java memory was set at a minimum heap size

400M and maximum heap size 800M.

A.3.5 Index A Results

Table A.2 displays the results of timings for Index A for queries 1, 2 and 3. The timings for

Index A demonstrate that the index performs efficiently while retrieving single items of data

because query 1 is performed in a small amount of time in memory, and with a persistent


Query: 1 2 (Join time) 3 (Join time)

In memory (s) 3.5 4.1 (1.9) 2194 (2032)

Persistent (s) 31 91 (41) 1473 (1353)

Table A.2: Summary of query timings for Index A, values are time in seconds. Persistentresults are from a test with a cold cache.

index. Query 2 is also returned within a fairly short time frame, however query 3 performs

poorly. Measurements were made to determine the proportion of time required by different

parts of the query. It was observed that the vast majority of the time taken was not in the

retrieval of data, but in carrying out the join operation (in italics in columns 2 and 3). In

this query, the search for the year of publication returned extremely large results sets. For

every query, the list of identifiers for documents that contained the correct year was very

large, and had to be compared with the identifiers returned for a particular author’s name, in

a search taking O(nm) time, which is inefficient. An unexpected result was that the length

of time to complete query 3 is longer in memory than in a persistent store. The result may

be explained by the lengthy join operations performing more slowly as the limits of memory

are reached, with the entire index loaded into memory.

A.3.6 Index B Results

Query: 1 2 3

Cold cache (s) 450 1298 1692

Caching on (s) 0.78 2.1 2.4

Table A.3: Summary of query timings for Index B, with different caching procedures.

The results for Index B are fairly complex but a summary of the main results is shown

in Table A.3. The results for “Caching On” refer to timing the queries after the index has

already been accessed for other tests in the same run, and “Cold cache” refers to performing

the queries immediately after starting Java. The method used to carry out join queries did

not allow the timing of the separate stages of the query to be assessed, therefore an exact

measure of the length of the join stage is not given. In theory, Index B carries out query

joins in a more efficient manner than Index A. The small time to retrieve query 3 with

“Caching On” demonstrates that query joins are carried out very efficiently in Index B. An

unexpected finding was that Index B retrieves single items of data significantly slower than

Index A, when run with a cold cache (31s for Index A, 450s for Index B). The result for


query 1 is particularly unexpected because the Structure Container is the only difference

between Index A and B, yet the XML structure objects within the Structure Container are

not accessed to complete query 1. Query 1 is completed by the author’s name being matched

in the correct Data Store and pointers followed to the locater objects. The locater objects are

referenced from the Data Stores in Index A, but are stored alongside the root of objects in

the Structure Container in Index B. The exact same number of operations will be performed

to complete query 1 on Index A and B. Therefore, the slow-down that is observed must be

accounted for by a difference in the overall size of the persistent store, or by differences in

the way in which objects are physically located on disk. The overall size of Index B is 25 %

greater than Index A (750MB vs 1 GB), but differences in retrieval time are approximately

10 fold. Therefore, it is likely that the containers in Index B are stored differently across

disk partitions, compared to Index A. Future work is required in this area to understand the

effects of different caching procedures and different methods of making containers persistent.

A set of similar queries was attempted for the PIR database, stored using the Index B

structure. A query to find 10,000 author names, retrieving 4.5 million records in total, takes

244 seconds from a cold cache, and only 3 seconds if the search is repeated with a warm

cache. A series of Boolean “AND” queries was attempted but errors were encountered due

to the limits of the PJama technology being reached.

A.3.7 Visualisation

One facet of querying an XML representation is enabling the user to visualise the structure

of the data and find the leaf nodes that correspond to the semantics of the data they wish

to find. An interface was created displaying the Data Path Tree (Figure A.4) enabling the

user to browse the structure of the data. Boolean queries can be formulated, and records

returned by the search can be viewed.

A.4 Discussion

The results for querying the DBLP bibliography demonstrate that rapid queries can be

performed against fairly large collections of data. The index returned results very slowly

from a cold cache, but rapidly if searches were repeated and data were already cached in

main memory. Unexpectedly, much better results were observed for a more simple index

structure that did not have the XML Structure Container, even for comparisons over queries

that did not utilise the Structure Container in the search. The problems observed may


Figure A.4: A prototype interface for querying an indexed store of XML data.

be related to the manner in which PJama caches objects in main memory, and the more

complex index structure is accessed much less efficiently. The caching policy of PJama is

difficult to analyse, and therefore future improvements to the index structure would require

implementing the structure in a different technology, such as the native disk I/O (input

output) methods offered by Java v1.4 [167].

The investigation demonstrates that indexing XML is a viable approach for storage of

local versions of biological databases, without requiring the overhead of implementing and

maintaining a relational database management system. The Xtect project has extended

this work at Glasgow University, and software has been developed to store local versions of

several different biological databases, and in addition to indexing, to perform matching across

different databases to find data paths that are semantically equivalent. Redundant paths

are removed from the index and a digest of the information that can be found about each

gene or protein can be presented to the user. This represents an advance in data integration

utilising XML, which is likely to be extended further in the near future.

The indexing investigation also demonstrates that a feasible solution for storage of bio-

logical data may be to use a native XML representation, although the technology employed

in this investigation is not robust enough for a large scale system. There are several com-


mercial database systems for XML, which may be a viable alternative to relational database

storage for biological data, especially for cases where there are frequent changes to the data

that must be stored.

Appendix B

Detailed diagrams of FGE-OM

The diagrams in this section display the contents of the six packages within the Proteomic-

sOM namespace of FGE-OM, which are described in detail in Chapter 4.

280

Appendix B. Detailed diagrams of FGE-OM 281

Figure B.1: The ProteinSeparation package of FGE-OM.


Figure B.2: The ProteomeBioAssay package.


Figure B.3: The ProteinData package.


Figure B.4: The ProteinRecord package.


Figure B.5: The MassSpecProtocol package.


Figure B.6: The MassSpecData package.

Appendix C

Database schema for RAPAD

/*==============================================================*/

/* Table: ACQUISITION */

/*==============================================================*/

create table ACQUISITION (

ACQUISITION_ID NUMBER(8) not null,

ASSAY_ID NUMBER(8) not null,

PROTOCOL_ID NUMBER(10),

CHANNEL_ID NUMBER(4),

ACQUISITION_DATE DATE,

NAME VARCHAR2(100),

URI VARCHAR2(255),

MODIFICATION_DATE DATE not null,

USER_READ NUMBER(1) not null,

USER_WRITE NUMBER(1) not null,

GROUP_READ NUMBER(1) not null,

GROUP_WRITE NUMBER(1) not null,

OTHER_READ NUMBER(1) not null,

OTHER_WRITE NUMBER(1) not null,

ROW_USER_ID NUMBER(12) not null,

ROW_GROUP_ID NUMBER(12) not null,

ROW_PROJECT_ID NUMBER(12) not null,

ROW_ALG_INVOCATION_ID NUMBER(12) not null,

constraint PK_ACQUISITION primary key (ACQUISITION_ID)

)

/

/*==============================================================*/

/* Index: ACQUISITION_IND05 */

/*==============================================================*/

create index ACQUISITION_IND05 on ACQUISITION (

ASSAY_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


CHANNEL_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


PROTOCOL_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


NAME ASC

)

/

/*==============================================================*/

/* Table: ACQUISITIONPARAM */

/*==============================================================*/

create table ACQUISITIONPARAM (

ACQUISITION_PARAM_ID NUMBER(5) not null,


PROTOCOL_PARAM_ID NUMBER(10),

NAME VARCHAR2(100) not null,

VALUE VARCHAR2(50) not null,












constraint PK_ACQUISITIONPARAM primary key

287

Appendix C. Database schema for RAPAD 288

(ACQUISITION_PARAM_ID)

)

/

/*==============================================================*/

/* Index: ACQPARAM_AK01 */

/*==============================================================*/

create unique index ACQPARAM_AK01 on ACQUISITIONPARAM (

ACQUISITION_ID ASC,

NAME ASC

)

/

/*==============================================================*/

/* Table: ANALYSIS */

/*==============================================================*/

create table ANALYSIS (

ANALYSIS_ID NUMBER(5) not null,


DESCRIPTION VARCHAR2(500),












constraint PK_ANALYSIS primary key (ANALYSIS_ID)

)

/

/*==============================================================*/

/* Table: ANALYSISIMPLEMENTATION */

/*==============================================================*/

create table ANALYSISIMPLEMENTATION (

ANALYSIS_IMPLEMENTATION_ID NUMBER(5) not null,

ANALYSIS_ID NUMBER(5) not null,














constraint PK_ANALYSISIMPLEMENTATION3 primary key

(ANALYSIS_IMPLEMENTATION_ID)

)

/

/*==============================================================*/

/* Index: ANALYSISIMPLEMENTATION_IND01 */

/*==============================================================*/

create index ANALYSISIMPLEMENTATION_IND01 on

ANALYSISIMPLEMENTATION (

ANALYSIS_ID ASC

)

/

/*==============================================================*/

/* Table: ANALYSISIMPLEMENTATIONPARAM */

/*==============================================================*/

create table ANALYSISIMPLEMENTATIONPARAM (

ANALYSIS_IMP_PARAM_ID NUMBER(5) not null,















constraint PK_ANALYSISIMPLPARAM primary key

(ANALYSIS_IMP_PARAM_ID)

)

/

/*==============================================================*/

/* Index: ANALYSISIMPPARAM_IND01 */

/*==============================================================*/

create index ANALYSISIMPPARAM_IND01 on ANALYSISIMPLEMENTATIONPARAM

(

ANALYSIS_IMPLEMENTATION_ID ASC

)

/

/*==============================================================*/

/* Table: ANALYSISINPUT */

/*==============================================================*/

create table ANALYSISINPUT (

ANALYSIS_INPUT_ID NUMBER(5) not null,

ANALYSIS_INVOCATION_ID NUMBER(5) not null,

TABLE_ID NUMBER(5),

INPUT_ROW_ID NUMBER(10),

INPUT_VALUE VARCHAR2(50),













constraint PK_ANALYSISINPUT primary key (ANALYSIS_INPUT_ID)

)

/

/*==============================================================*/

/* Index: ANALYSISINPUT_IND01 */

/*==============================================================*/

create index ANALYSISINPUT_IND01 on ANALYSISINPUT (

ANALYSIS_INVOCATION_ID ASC

)

/

/*==============================================================*/

/* Index: ANALYSISINPUT_IND02 */

/*==============================================================*/

create index ANALYSISINPUT_IND02 on ANALYSISINPUT (

TABLE_ID ASC

)

/

/*==============================================================*/

/* Table: ANALYSISINVOCATION */

/*==============================================================*/

create table ANALYSISINVOCATION (
















constraint PK_ANALYSISINVOCATION primary key

(ANALYSIS_INVOCATION_ID)

)

/

/*==============================================================*/

/* Index: ANALYSISINVOCATION_IND01 */

/*==============================================================*/

create index ANALYSISINVOCATION_IND01 on ANALYSISINVOCATION (

ANALYSIS_IMPLEMENTATION_ID ASC

)

/

/*==============================================================*/

/* Table: ANALYSISINVOCATIONPARAM */

/*==============================================================*/

create table ANALYSISINVOCATIONPARAM (

ANALYSIS_INVOCATION_PARAM_ID NUMBER(5) not null,















constraint PK_ANAYLSISINVOCATIONPARAM primary key

(ANALYSIS_INVOCATION_PARAM_ID)

)

/

/*==============================================================*/

/* Index: ANALYSISINVOCATIONPARAM_IND01 */

/*==============================================================*/

create index ANALYSISINVOCATIONPARAM_IND01 on

ANALYSISINVOCATIONPARAM (


)

/

/*==============================================================*/

/* Table: ANALYSISOUTPUT */

/*==============================================================*/

create table ANALYSISOUTPUT (

ANALYSIS_OUTPUT_ID NUMBER(10) not null,



TYPE VARCHAR2(50) not null,

VALUE NUMBER(5) not null,












constraint PK_ANALYSISOUTPUT primary key (ANALYSIS_OUTPUT_ID)

)


/

/*==============================================================*/

/* Index: ANALYSISOUTPUT_IND01 */

/*==============================================================*/

create index ANALYSISOUTPUT_IND01 on ANALYSISOUTPUT (


)

/

/*==============================================================*/

/* Table: ANALYTEMEASUREMENT */

/*==============================================================*/

create table ANALYTEMEASUREMENT (

BIO_MATERIAL_MEASUREMENT_ID NUMBER(10) not null,

BIO_MATERIAL_ID NUMBER(8) not null,

BIOASSAY_TREATMENT_ID NUMBER(8),

VALUE FLOAT,

UNIT_TYPE_ID NUMBER(5),

MEASUREMENT_DESCRIPTION VARCHAR2(300),












constraint PK_ANALYTE_MEASUREMENT primary key

(BIO_MATERIAL_MEASUREMENT_ID)

)

/

/*==============================================================*/

/* Index: ANALYTE_MEASUREMENT_IND04 */

/*==============================================================*/

create unique index ANALYTE_MEASUREMENT_IND04 on

ANALYTEMEASUREMENT (

BIO_MATERIAL_ID ASC

)

/

/*==============================================================*/

/* Table: ARRAY */

/*==============================================================*/

create table ARRAY (

ARRAY_ID NUMBER(4) not null,

MANUFACTURER_ID NUMBER(12) not null,

PLATFORM_TYPE_ID NUMBER(10) not null,

SUBSTRATE_TYPE_ID NUMBER(10),


EXTERNAL_DATABASE_RELEASE_ID NUMBER(4),

SOURCE_ID VARCHAR2(100),


VERSION VARCHAR2(50) not null,


ARRAY_DIMENSIONS VARCHAR2(50),

ELEMENT_DIMENSIONS VARCHAR2(50),

NUMBER_OF_ELEMENTS NUMBER(10),

NUM_ARRAY_COLUMNS NUMBER(3),

NUM_ARRAY_ROWS NUMBER(3),

NUM_GRID_COLUMNS NUMBER(3),

NUM_GRID_ROWS NUMBER(3),

NUM_SUB_COLUMNS NUMBER(6),

NUM_SUB_ROWS NUMBER(6),












constraint PK_ARRAY primary key (ARRAY_ID)

)

/

/*==============================================================*/

/* Index: ARRAY_AK01 */

/*==============================================================*/

create unique index ARRAY_AK01 on ARRAY (

NAME ASC,

VERSION ASC

)

/

/*==============================================================*/

/* Index: ARRAY_IND02 */

/*==============================================================*/

create index ARRAY_IND02 on ARRAY (

EXTERNAL_DATABASE_RELEASE_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


PLATFORM_TYPE_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


SUBSTRATE_TYPE_ID ASC

)

/


/*==============================================================*/


/*==============================================================*/


PROTOCOL_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


MANUFACTURER_ID ASC

)

/

/*==============================================================*/

/* Table: ARRAYANNOTATION */

/*==============================================================*/

create table ARRAYANNOTATION (

ARRAY_ANNOTATION_ID NUMBER(5) not null,















constraint PK_ARRAYANNOTATION primary key (ARRAY_ANNOTATION_ID)

)

/

/*==============================================================*/

/* Index: ARRAYANNOTATION_IND01 */

/*==============================================================*/

create index ARRAYANNOTATION_IND01 on ARRAYANNOTATION (

ARRAY_ID ASC

)

/

/*==============================================================*/

/* Table: ASSAY */

/*==============================================================*/

create table ASSAY (




ASSAY_DATE DATE,

ARRAY_IDENTIFIER VARCHAR2(100),

ARRAY_BATCH_IDENTIFIER VARCHAR2(100),

OPERATOR_ID NUMBER(10) not null,



NAME VARCHAR2(100),













constraint PK_ASSAY primary key (ASSAY_ID)

)

/

/*==============================================================*/

/* Index: ASSAY_INDEX */

/*==============================================================*/

create index ASSAY_INDEX on ASSAY (

ARRAY_ID ASC

)

/

/*==============================================================*/

/* Index: ASSAY_IND02 */

/*==============================================================*/

create index ASSAY_IND02 on ASSAY (

OPERATOR_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


PROTOCOL_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/



)

/

/*==============================================================*/


/*==============================================================*/


NAME ASC


)

/

/*==============================================================*/

/* Table: ASSAYBIOMATERIAL */

/*==============================================================*/

create table ASSAYBIOMATERIAL (

ASSAY_BIO_MATERIAL_ID NUMBER(5) not null,














constraint PK_ASSAYBIOMATERIAL primary key

(ASSAY_BIO_MATERIAL_ID)

)

/

/*==============================================================*/

/* Index: ASSAYBIOMATERIAL_IND01 */

/*==============================================================*/

create index ASSAYBIOMATERIAL_IND01 on ASSAYBIOMATERIAL (

BIO_MATERIAL_ID ASC

)

/

/*==============================================================*/

/* Index: ASSAYBIOMATERIAL_IND02 */

/*==============================================================*/

create index ASSAYBIOMATERIAL_IND02 on ASSAYBIOMATERIAL (

ASSAY_ID ASC

)

/

/*==============================================================*/

/* Table: ASSAYDATAPOINT */

/*==============================================================*/

create table ASSAYDATAPOINT (

id NUMBER(8) not null,

time float not null,

protein_assay float not null,

lc_column NUMBER(8) not null,

constraint PK_ASSAYDATAPOINT primary key (lc_column, id)

)

/

/*==============================================================*/

/* Table: ASSAYGROUP */

/*==============================================================*/

create table ASSAYGROUP (

STUDY_ID NUMBER(4) not null,


STUDY_DESIGN_ID NUMBER(5) not null,

FACTOR_VALUE VARCHAR2(100) not null,

STUDY_FACTOR_VALUE_ID NUMBER(8),

constraint PK_ASSAYGROUP primary key

(STUDY_ID, ASSAY_ID, STUDY_DESIGN_ID)

)

/

/*==============================================================*/

/* Table: ASSAYLABELEDEXTRACT */

/*==============================================================*/

create table ASSAYLABELEDEXTRACT (

ASSAY_LABELED_EXTRACT_ID NUMBER(8) not null,


LABELED_EXTRACT_ID NUMBER(8) not null,

CHANNEL_ID NUMBER(4) not null,












constraint PK_ASSAYLABELEDEXTRACT primary key

(ASSAY_LABELED_EXTRACT_ID)

)

/

/*==============================================================*/

/* Index: ASSAYLABELEDEXTRACT_IND01 */

/*==============================================================*/

create index ASSAYLABELEDEXTRACT_IND01 on ASSAYLABELEDEXTRACT (

ASSAY_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


CHANNEL_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/



LABELED_EXTRACT_ID ASC

)

/

/*==============================================================*/

/* Table: ASSAYPARAM */

/*==============================================================*/

create table ASSAYPARAM (

ASSAY_PARAM_ID NUMBER(10) not null,


PROTOCOL_PARAM_ID NUMBER(10) not null,













constraint PK_ASSAYPARAM primary key (ASSAY_PARAM_ID)

)

/

/*==============================================================*/

/* Index: ASSAYPARAM_IND01 */

/*==============================================================*/

create index ASSAYPARAM_IND01 on ASSAYPARAM (

ASSAY_ID ASC

)

/

/*==============================================================*/

/* Index: ASSAYPARAM_IND02 */

/*==============================================================*/

create index ASSAYPARAM_IND02 on ASSAYPARAM (

PROTOCOL_PARAM_ID ASC

)

/

/*==============================================================*/

/* Table: ASSAYPARAMPROT */

/*==============================================================*/

create table ASSAYPARAMPROT (

ASSAY_PARAM_ID NUMBER(10) not null,

PROTEOME_ASSAY_ID NUMBER(8),














constraint PK_ASSAYPARAM_PROT primary key (ASSAY_PARAM_ID)

)

/

/*==============================================================*/

/* Index: ASSAYPARAM_PROT_IND02 */

/*==============================================================*/

create index ASSAYPARAM_PROT_IND02 on ASSAYPARAMPROT (


)

/

/*==============================================================*/

/* Table: BAND */

/*==============================================================*/

create table BAND (


area float,

intensity float,

local_background float,

annotation varchar(200),

annotation_source varchar(200),

volume float,

pixel_x_coord float,

pixel_y_coord float,

pixel_radius float,

normalisation varchar(200),

normalised_volume float,

lane_number float not null,

apparent_mass float not null,

gel_1d NUMBER(8) not null,

physicalGelSpot_ID NUMBER(8),












constraint PK_BAND primary key (gel_1d, id)

)

/

/*==============================================================*/

/* Table: BIOASSAYTREATMENT */

/*==============================================================*/


create table BIOASSAYTREATMENT (

BIOASSAY_TREATMENT_ID NUMBER(8) not null,

ORDER_NUM NUMBER(3) not null,



TREATMENT_TYPE_ID NUMBER(10),












constraint PK_BIOASSAYTREATMENT primary key

(BIOASSAY_TREATMENT_ID)

)

/

/*==============================================================*/

/* Table: BIOMATERIALCHARACTERISTIC */

/*==============================================================*/

create table BIOMATERIALCHARACTERISTIC (

BIO_MATERIAL_CHARACTERISTIC_ID NUMBER(5) not null,


ONTOLOGY_ENTRY_ID NUMBER(10) not null,

VALUE VARCHAR2(100),












constraint PK_BIOAMATCHARACTERISTIC primary key

(BIO_MATERIAL_CHARACTERISTIC_ID)

)

/

/*==============================================================*/

/* Index: BIOMATCHARACTERISTIC_IND01 */

/*==============================================================*/

create index BIOMATCHARACTERISTIC_IND01 on

BIOMATERIALCHARACTERISTIC (

BIO_MATERIAL_ID ASC

)

/

/*==============================================================*/

/* Index: BIOMATCHARACTERISTIC_IND02 */

/*==============================================================*/

create index BIOMATCHARACTERISTIC_IND02 on

BIOMATERIALCHARACTERISTIC (

ONTOLOGY_ENTRY_ID ASC

)

/

/*==============================================================*/

/* Table: BIOMATERIALIMP */

/*==============================================================*/

create table BIOMATERIALIMP (


LABEL_METHOD_ID NUMBER(4),

TAXON_ID NUMBER(10),

BIO_SOURCE_PROVIDER_ID NUMBER(12),

BIO_MATERIAL_TYPE_ID NUMBER(10),

SUBCLASS_VIEW VARCHAR2(27) not null,



STRING1 VARCHAR2(100),













constraint PK_BIOMATERIALIMP primary key (BIO_MATERIAL_ID)

)

/

/*==============================================================*/

/* Index: BIOMATERIALIMP_IND01 */

/*==============================================================*/

create index BIOMATERIALIMP_IND01 on BIOMATERIALIMP (

LABEL_METHOD_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


TAXON_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


BIO_MATERIAL_TYPE_ID ASC

)

/


/*==============================================================*/


/*==============================================================*/


BIO_SOURCE_PROVIDER_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/



)

/

/*==============================================================*/

/* Table: BIOMATERIALMEASUREMENT */

/*==============================================================*/

create table BIOMATERIALMEASUREMENT (

BIO_MATERIAL_MEASUREMENT_ID NUMBER(10) not null,

TREATMENT_ID NUMBER(10) not null,


VALUE FLOAT,













constraint PK_BIOMATERIALMEASUREMENT primary key

(BIO_MATERIAL_MEASUREMENT_ID)

)

/

/*==============================================================*/

/* Index: BIOMATERIALMEASUREMENT_IND03 */

/*==============================================================*/

create index BIOMATERIALMEASUREMENT_IND03 on

BIOMATERIALMEASUREMENT (

TREATMENT_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/



BIO_MATERIAL_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/



UNIT_TYPE_ID ASC

)

/

/*==============================================================*/

/* Table: BOUNDARYPOINT */

/*==============================================================*/

create table BOUNDARYPOINT (


pixel_x_coord float not null,

pixel_y_coord float not null,

spot_gel_2d NUMBER(8) not null,

spot_id NUMBER(8) not null,

physicalGelItem_ID NUMBER(8) not null,

gel2D NUMBER(8) not null,












constraint PK_BOUNDARYPOINT primary key

(spot_gel_2d, spot_id, physicalGelItem_ID, gel2D, id)

)

/

/*==============================================================*/

/* Table: CHANNEL */

/*==============================================================*/

create table CHANNEL (



DEFINITION VARCHAR2(500) not null,












constraint PK_CHANNEL primary key (CHANNEL_ID)


)

/

/*==============================================================*/

/* Index: CHANNEL_AK01 */

/*==============================================================*/

create unique index CHANNEL_AK01 on CHANNEL (

NAME ASC

)

/

/*==============================================================*/

/* Table: CHEMICALTREATMENT */

/*==============================================================*/

create table CHEMICALTREATMENT (

chemical_treatment_ID NUMBER(8) not null,


treatment_type NUMBER(10),

digestion varchar(200) not null,

derivatisations varchar(200) not null,












constraint PK_CHEMICALTREATMENT primary key

(chemical_treatment_ID)

)

/

/*==============================================================*/

/* Table: COLLISIONCELL */

/*==============================================================*/

create table COLLISIONCELL (

collision_cellID NUMBER(8) not null,

mz_analysis_ID NUMBER(8),

gas_type varchar(100) not null,

gas_pressure float not null,

collision_offset float not null,












constraint PK_COLLISIONCELL primary key (collision_cellID)

)

/

/*==============================================================*/

/* Table: COMPOSITEELEMENTANNOTATION */

/*==============================================================*/

create table COMPOSITEELEMENTANNOTATION (

COMPOSITE_ELEMENT_ANNOT_ID NUMBER(12) not null,

COMPOSITE_ELEMENT_ID NUMBER(10) not null,














constraint PK_COMPOSITEELEMENTANNOTATION primary key

(COMPOSITE_ELEMENT_ANNOT_ID)

)

/

/*==============================================================*/

/* Index: COMPELEMENTANNOT_IND01 */

/*==============================================================*/

create index COMPELEMENTANNOT_IND01 on COMPOSITEELEMENTANNOTATION

(

COMPOSITE_ELEMENT_ID ASC

)

/

/*==============================================================*/

/* Table: COMPOSITEELEMENTGUS */

/*==============================================================*/

create table COMPOSITEELEMENTGUS (

COMPOSITE_ELEMENT_GUS_ID NUMBER(12) not null,

COMPOSITE_ELEMENT_ID NUMBER(10),

TABLE_ID NUMBER(5) not null,

ROW_ID NUMBER(12) not null,












constraint PK_COMPOSITEELEMENTGUS primary key

(COMPOSITE_ELEMENT_GUS_ID)


)

/

/*==============================================================*/

/* Table: COMPOSITEELEMENTIMP */

/*==============================================================*/

create table COMPOSITEELEMENTIMP (


PARENT_ID NUMBER(10),





TINYINT1 NUMBER(3),

SMALLINT1 NUMBER(5),


CHAR1 VARCHAR2(5),

CHAR2 VARCHAR2(5),

TINYSTRING1 VARCHAR2(50),


SMALLSTRING1 VARCHAR2(100),















constraint PK_COMPOSITEELEMENTIMP primary key

(COMPOSITE_ELEMENT_ID)

)

/

/*==============================================================*/

/* Index: RAD3_SPOTFAMILY_IND01 */

/*==============================================================*/

create index RAD3_SPOTFAMILY_IND01 on COMPOSITEELEMENTIMP (

COMPOSITE_ELEMENT_ID ASC,

SMALLSTRING1 ASC,

SMALLSTRING2 ASC

)

/

/*==============================================================*/

/* Index: RAD3_SPOTFAMILY_IND02 */

/*==============================================================*/

create index RAD3_SPOTFAMILY_IND02 on COMPOSITEELEMENTIMP (

ARRAY_ID ASC,

EXTERNAL_DATABASE_RELEASE_ID ASC,

SOURCE_ID ASC

)

/

/*==============================================================*/

/* Index: SAGETAG_IND01 */

/*==============================================================*/

create index SAGETAG_IND01 on COMPOSITEELEMENTIMP (

ARRAY_ID ASC,

TINYSTRING1 ASC

)

/

/*==============================================================*/


/*==============================================================*/


PARENT_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/



)

/

/*==============================================================*/

/* Table: COMPOSITEELEMENTRESULTIMP */

/*==============================================================*/

create table COMPOSITEELEMENTRESULTIMP (

COMPOSITE_ELEMENT_RESULT_ID NUMBER(10) not null,


QUANTIFICATION_ID NUMBER(8) not null,


FLOAT1 FLOAT,

FLOAT2 FLOAT,

FLOAT3 FLOAT,

FLOAT4 FLOAT,

INT1 NUMBER(12),




TINYINT1 NUMBER(3),

TINYINT2 NUMBER(3),

TINYINT3 NUMBER(3),

CHAR1 VARCHAR2(5),

CHAR2 VARCHAR2(5),

CHAR3 VARCHAR2(5),















constraint PK_SFMRES primary key (COMPOSITE_ELEMENT_RESULT_ID)

)

/

/*==============================================================*/

/* Index: COMPELEMENTRESULTIMP_IND01 */

/*==============================================================*/

create index COMPELEMENTRESULTIMP_IND01 on

COMPOSITEELEMENTRESULTIMP (


SUBCLASS_VIEW ASC

)

/

/*==============================================================*/

/* Index: COMPELEMENTRESULTIMP_IND02 */

/*==============================================================*/

create index COMPELEMENTRESULTIMP_IND02 on

COMPOSITEELEMENTRESULTIMP (


QUANTIFICATION_ID ASC,

SUBCLASS_VIEW ASC

)

/

/*==============================================================*/

/* Table: CONTROL */

/*==============================================================*/

create table CONTROL (

CONTROL_ID NUMBER(5) not null,

CONTROL_TYPE_ID NUMBER(10) not null,




NAME VARCHAR2(100),













constraint PK_CONTROL primary key (CONTROL_ID)

)

/

/*==============================================================*/

/* Index: CONTROL_IND01 */

/*==============================================================*/

create index CONTROL_IND01 on CONTROL (

TABLE_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


ASSAY_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


CONTROL_TYPE_ID ASC

)

/

/*==============================================================*/

/* Table: DATABASEENTRY */

/*==============================================================*/

create table DATABASEENTRY (

database_name NUMBER(10),

database_version NUMBER(10),

database_uri NUMBER(10),

database_entry_ID NUMBER(8) not null,

accession VARCHAR(500),












constraint PK_DATABASE_ENTRY primary key (database_entry_ID)

)

/

/*==============================================================*/

/* Table: DBSEARCH */

/*==============================================================*/

create table DBSEARCH (

db_search_ID NUMBER(8) not null,

peak_list_ID NUMBER(8),

db_search_parameters_ID NUMBER(8),

username varchar(100) not null,

id_date date not null,

n_terminal_aa varchar(100),

c_terminal_aa varchar(100),

count_of_specific_aa NUMBER(8),


name_of_counted_aa varchar(100),

regex_pattern varchar(100),












search_file_uri VARCHAR2(300),

constraint PK_DBSEARCH primary key (db_search_ID)

)

/

/*==============================================================*/

/* Table: DBSEARCHPARAMETERS */

/*==============================================================*/

create table DBSEARCHPARAMETERS (

db_search_parameters_ID NUMBER(8) not null,


program varchar(100) not null,

database varchar(100) not null,

database_date date not null,

taxonomical_filter NUMBER(8),

fixed_modifications varchar(100),

variable_modifications varchar(100),

max_missed_cleavages NUMBER(8),

mass_value_type varchar(100),

fragment_ion_tolerance float,

peptide_mass_tolerance float,

accurate_mass_mode NUMBER(8),

mass_error_type varchar(100),

mass_error float,

protonated NUMBER(8),

icat_option NUMBER(8),












constraint PK_DBSEARCHPARAMETERS primary key

(db_search_parameters_ID)

)

/

/*==============================================================*/

/* Table: DETECTION */

/*==============================================================*/

create table DETECTION (

detection_ID NUMBER(8) not null,

type varchar(9)

constraint CKC_TYPE_DETECTIO check (type is null or (

type in (’photomultiplier’,’electron

multiplier’,’micro-channel plate’,’ICR’) )),

constraint PK_DETECTION primary key (detection_ID)

)

/

/*==============================================================*/

/* Table: DIGESINGLESPOT */

/*==============================================================*/

create table DIGESINGLESPOT (

identified_spot_ID NUMBER(8),

DIGESingleSpot_ID NUMBER(8) not null,

GEL_IMAGE_ANALYSIS_ID NUMBER(8) not null,

SPOT_MEASURES_ID NUMBER(10),












constraint PK_DIGESINGLESPOT primary key (DIGESingleSpot_ID)

)

/

/*==============================================================*/

/* Table: ELECTROSPRAY */

/*==============================================================*/

create table ELECTROSPRAY (

electrospray_ID NUMBER(8) not null,

ion_source_ID NUMBER(8),

spray_tip_voltage float,

spray_tip_diameter float not null,

solution_voltage float,

cone_voltage float not null,

loading_type varchar(2)

constraint CKC_LOADING_TYPE_ELECTROS check

(loading_type is null or ( loading_type in (’LC’,’DI’)

)),

solvent varchar(100) not null,

interface_manufacturer varchar(200) not null,

spray_tip_manufacturer varchar(200) not null,

ion_source NUMBER(8),













constraint PK_ELECTROSPRAY primary key (electrospray_ID)

)

/

/*==============================================================*/

/* Table: ELEMENTANNOTATION */

/*==============================================================*/

create table ELEMENTANNOTATION (

ELEMENT_ANNOTATION_ID NUMBER(10) not null,

ELEMENT_ID NUMBER(10) not null,














constraint PK_ELEMENTANNOTATION primary key

(ELEMENT_ANNOTATION_ID)

)

/

/*==============================================================*/

/* Index: ELEMENTANNOTATION_IND01 */

/*==============================================================*/

create index ELEMENTANNOTATION_IND01 on ELEMENTANNOTATION (

ELEMENT_ID ASC

)

/

/*==============================================================*/

/* Table: ELEMENTIMP */

/*==============================================================*/

create table ELEMENTIMP (


COMPOSITE_ELEMENT_ID NUMBER(10),


ELEMENT_TYPE_ID NUMBER(10),




TINYINT1 NUMBER(3),


CHAR1 VARCHAR2(5),

CHAR2 VARCHAR2(5),

CHAR3 VARCHAR2(5),

CHAR4 VARCHAR2(5),

CHAR5 VARCHAR2(5),

CHAR6 VARCHAR2(5),

CHAR7 VARCHAR2(5),


















constraint PK_ELEMENT primary key (ELEMENT_ID)

)

/

/*==============================================================*/


/*==============================================================*/

create index ARRAY_IND01 on ELEMENTIMP (

ARRAY_ID ASC

)

/

/*==============================================================*/

/* Index: RAD3_SPOT_IND02 */

/*==============================================================*/

create index RAD3_SPOT_IND02 on ELEMENTIMP (

COMPOSITE_ELEMENT_ID ASC

)

/

/*==============================================================*/

/* Index: SHORTOLIGO_IND02 */

/*==============================================================*/

create index SHORTOLIGO_IND02 on ELEMENTIMP (

ELEMENT_TYPE_ID ASC

)

/

/*==============================================================*/

/* Index: SHORTOLIGO_IND03 */

/*==============================================================*/

create index SHORTOLIGO_IND03 on ELEMENTIMP (


)

/

/*==============================================================*/

/* Table: ELEMENTRESULTIMP */

/*==============================================================*/


create table ELEMENTRESULTIMP (

ELEMENT_RESULT_ID NUMBER(10) not null,


COMPOSITE_ELEMENT_RESULT_ID NUMBER(10),



FOREGROUND FLOAT,

BACKGROUND FLOAT,

FOREGROUND_SD FLOAT,

BACKGROUND_SD FLOAT,

FLOAT1 FLOAT,

FLOAT2 FLOAT,

FLOAT3 FLOAT,

FLOAT4 FLOAT,

FLOAT5 FLOAT,

FLOAT6 FLOAT,

FLOAT7 FLOAT,

FLOAT8 FLOAT,

FLOAT9 FLOAT,

FLOAT10 FLOAT,

FLOAT11 FLOAT,

FLOAT12 FLOAT,

FLOAT13 FLOAT,

FLOAT14 FLOAT,

INT1 NUMBER(12),

INT2 NUMBER(12),

INT3 NUMBER(12),

INT4 NUMBER(12),

INT5 NUMBER(12),

INT6 NUMBER(12),

INT7 NUMBER(12),

INT8 NUMBER(12),

INT9 NUMBER(12),

INT10 NUMBER(12),

INT11 NUMBER(12),

INT12 NUMBER(12),

INT13 NUMBER(12),

INT14 NUMBER(12),

INT15 NUMBER(12),

TINYINT1 NUMBER(3),

TINYINT2 NUMBER(3),

TINYINT3 NUMBER(3),




CHAR1 VARCHAR2(5),

CHAR2 VARCHAR2(5),

CHAR3 VARCHAR2(5),

CHAR4 VARCHAR2(5),



















constraint PK_ELEMENTRESULT_N primary key (ELEMENT_RESULT_ID)

)

/

/*==============================================================*/

/* Index: ELEMENTRESULTIMP_IND01 */

/*==============================================================*/

create index ELEMENTRESULTIMP_IND01 on ELEMENTRESULTIMP (

ELEMENT_ID ASC,

SUBCLASS_VIEW ASC

)

/

/*==============================================================*/


/*==============================================================*/


ELEMENT_ID ASC,

QUANTIFICATION_ID ASC,

SUBCLASS_VIEW ASC

)

/

/*==============================================================*/


/*==============================================================*/


COMPOSITE_ELEMENT_RESULT_ID ASC

)

/

/*==============================================================*/

/* Table: FRACTION */

/*==============================================================*/

create table FRACTION (

Fraction_ID NUMBER(8) not null,

start_point float not null,

end_point float not null,

protein_assay float,

LCColumn_ID NUMBER(8) not null,

BIO_MATERIAL_ID NUMBER(8),












constraint PK_FRACTION primary key (Fraction_ID, LCColumn_ID)


)

/

/*==============================================================*/

/* Table: GEL1D */

/*==============================================================*/

create table GEL1D (

Gel1D_ID NUMBER(8) not null,


description varchar(100) not null,

equipment varchar(200) not null,

percent_acrylamide float not null,

solubilization_buffer varchar(100) not null,

stain_details varchar(200) not null,


in_gel_digestion varchar(100),

background varchar(100),

pixel_size_x varchar(100),

pixel_size_y varchar(100),

denaturing_agent varchar(100),

mass_start float,

mass_end float,

run_details varchar(300),












constraint PK_GEL1D primary key (Gel1D_ID)

)

/

/*==============================================================*/

/* Table: GEL2D */

/*==============================================================*/

create table GEL2D (

Gel2D_ID NUMBER(8) not null,


description varchar(500),

equipment varchar(200),

percent_acrylamide float,

solubilization_buffer varchar(100),

stain_details varchar(200),


in_gel_digestion varchar(100),

background varchar(100),

pixel_size_x varchar(100),

pixel_size_y varchar(100),

pi_start float,

pi_end float,

mass_start float,

mass_end float,

first_dim_details varchar(200),

second_dim_details varchar(200),

dimensionX NUMBER(8),

dimensionY NUMBER(8),

dimensionZ NUMBER(8),












constraint PK_GEL2D primary key (Gel2D_ID)

)

/

/*==============================================================*/

/* Table: GELIMAGEANALYSIS */

/*==============================================================*/

create table GELIMAGEANALYSIS (



GEL_IMAGE_ANALYSIS_ID NUMBER(8) not null,

processing_description VARCHAR(300),

warped_image VARCHAR(100),

warping_map VARCHAR(100),












name VARCHAR(100),

IMAGEANALYSIS_DATE DATE,

constraint PK_GEL_IMAGE_ANALYSIS primary key

(GEL_IMAGE_ANALYSIS_ID)

)

/

/*==============================================================*/

/* Table: GRADIENTSTEP */

/*==============================================================*/

create table GRADIENTSTEP (

GradientStep_ID NUMBER(8) not null,

step_time float not null,














constraint PK_GRADIENTSTEP primary key

(lc_column, GradientStep_ID)

)

/

/*==============================================================*/

/* Table: HEXAPOLE */

/*==============================================================*/

create table HEXAPOLE (

hexapole_ID NUMBER(8) not null,














constraint PK_HEXAPOLE primary key (hexapole_ID)

)

/

/*==============================================================*/

/* Table: IDENTIFIEDSPOT */

/*==============================================================*/

create table IDENTIFIEDSPOT (

identified_spot_ID NUMBER(8) not null,

area float,

intensity float,

local_background float,

annotation varchar(200),

annotation_source varchar(200),

volume float,

pixel_x_coord float,

pixel_y_coord float,

pixel_radius float,

normalisation varchar(200),

normalised_volume float,

apparent_pi float,

apparent_mass float,


GEL_IMAGE_ANALYSIS_ID NUMBER(8),

SPOT_MEASURES_ID NUMBER(10),

peakHeight NUMBER(8),












constraint PK_IDENTIFIEDSPOT primary key (identified_spot_ID)

)

/

/*==============================================================*/

/* Table: IMAGEACQUISITION */

/*==============================================================*/

create table IMAGEACQUISITION (

IMAGE_ACQUISITION_ID NUMBER(8) not null,


CHANNEL_ID NUMBER(4),


ACQUISITION_DATE DATE,

NAME VARCHAR2(100),

URI VARCHAR2(255),












constraint PK_IMAGE_ACQUISITION primary key

(IMAGE_ACQUISITION_ID)

)

/

/*==============================================================*/


/*==============================================================*/

create index ACQUISITION_IND02 on IMAGEACQUISITION (

CHANNEL_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


PROTOCOL_ID ASC

)

/

/*==============================================================*/



/*==============================================================*/


NAME ASC

)

/

/*==============================================================*/

/* Table: INTEGRITYSTATINPUT */

/*==============================================================*/

create table INTEGRITYSTATINPUT (

INTEGRITY_STAT_INPUT_ID NUMBER(10) not null,

INTEGRITY_STATISTIC_ID NUMBER(8) not null,

INPUT_TABLE_ID NUMBER(10) not null,

INPUT_ROW_ID NUMBER(10) not null,

ROW_DESIGNATION VARCHAR2(200),

IS_TRUSTED_INPUT NUMBER(1),












constraint PK_INTEGRITYSTATINPUT primary key

(INTEGRITY_STAT_INPUT_ID)

)

/

/*==============================================================*/

/* Index: INTEGRITYSTATINPUT_IND01 */

/*==============================================================*/

create index INTEGRITYSTATINPUT_IND01 on INTEGRITYSTATINPUT (

INTEGRITY_STATISTIC_ID ASC

)

/

/*==============================================================*/

/* Index: INTEGRITYSTATINPUT_IND02 */

/*==============================================================*/

create index INTEGRITYSTATINPUT_IND02 on INTEGRITYSTATINPUT (

INPUT_TABLE_ID ASC

)

/

/*==============================================================*/

/* Table: INTEGRITYSTATISTIC */

/*==============================================================*/

create table INTEGRITYSTATISTIC (

INTEGRITY_STATISTIC_ID NUMBER(8) not null,

STATISTIC_METHOD VARCHAR2(200) not null,

TRUSTED_INPUT_FORMULA VARCHAR2(200),













constraint PK_INTEGRITYSTATISTIC primary key

(INTEGRITY_STATISTIC_ID)

)

/

/*==============================================================*/

/* Table: IONSOURCE */

/*==============================================================*/

create table IONSOURCE (

ion_source_ID NUMBER(8) not null,

collision_energy float,

type varchar(12)

constraint CKC_TYPE_IONSOURC check (type is null or

( type in (’MALDI’,’Electrospray’,’OtherIonisation’) )),

mz_analysis NUMBER(8),












constraint PK_IONSOURCE primary key (ion_source_ID)

)

/

/*==============================================================*/

/* Table: IONTRAP */

/*==============================================================*/

create table IONTRAP (

ion_trap_ID NUMBER(8) not null,


gas_type varchar(100) not null,

gas_pressure float not null,

rf_frequency float,

excitation_amplitude float,

isolation_centre float not null,

isolation_width float not null,

final_ms_level float,













constraint PK_IONTRAP primary key (ion_trap_ID)

)

/

/*==============================================================*/

/* Table: LABELMETHOD */

/*==============================================================*/

create table LABELMETHOD (

LABEL_METHOD_ID NUMBER(4) not null,

PROTOCOL_ID NUMBER(10) not null,


LABEL_USED VARCHAR2(50),

LABEL_METHOD VARCHAR2(1000),












constraint PK_LABEL primary key (LABEL_METHOD_ID)

)

/

/*==============================================================*/

/* Index: LABELMETHOD_IND01 */

/*==============================================================*/

create index LABELMETHOD_IND01 on LABELMETHOD (

PROTOCOL_ID ASC

)

/

/*==============================================================*/

/* Index: LABELMETHOD_IND02 */

/*==============================================================*/

create index LABELMETHOD_IND02 on LABELMETHOD (

CHANNEL_ID ASC

)

/

/*==============================================================*/

/* Table: LCCOLUMN */

/*==============================================================*/

create table LCCOLUMN (

LCColumn_ID NUMBER(8) not null,

BIOASSAY_TREATMENT_ID NUMBER(8),


manufacturer varchar(100) not null,

part_number varchar(50) not null,

batch_number varchar(50) not null,

internal_length float not null,

internal_diameter float not null,

stationary_phase varchar(200) not null,

bead_size float,

pore_size float,

temperature float not null,

flow_rate float,

injection_volume float not null,

parameters_file varchar(100) not null,

lc_column NUMBER(8),












constraint PK_LCCOLUMN primary key (LCColumn_ID)

)

/

/*==============================================================*/

/* Table: LISTPROCESSING */

/*==============================================================*/

create table LISTPROCESSING (

list_processing_ID NUMBER(8) not null,

smoothing_process varchar(100) not null,

background_threshold float not null,













constraint PK_LISTPROCESSING primary key (list_processing_ID)

)

/

/*==============================================================*/

/* Table: MAGEDOCUMENTATION */

/*==============================================================*/


create table MAGEDOCUMENTATION (

MAGE_DOCUMENTATION_ID NUMBER(5) not

null,

MAGE_ML_ID NUMBER(8) not null,



MAGE_IDENTIFIER VARCHAR2(100) not null,












constraint PK_MAGEDOCUMENTATION primary key

(MAGE_DOCUMENTATION_ID)

)

/

/*==============================================================*/

/* Index: MAGEDOCUMENTATION_IND01 */

/*==============================================================*/

create index MAGEDOCUMENTATION_IND01 on MAGEDOCUMENTATION (

TABLE_ID ASC

)

/

/*==============================================================*/

/* Index: MAGEDOCUMENTATION_IND02 */

/*==============================================================*/

create index MAGEDOCUMENTATION_IND02 on MAGEDOCUMENTATION (

MAGE_ML_ID ASC

)

/

/*==============================================================*/

/* Table: MAGEML */

/*==============================================================*/

create table MAGEML (

MAGE_ML_ID NUMBER(8) not null,

MAGE_PACKAGE VARCHAR2(100) not null,

MAGE_ML CLOB not null,












constraint PK_MAGEML primary key (MAGE_ML_ID)

)

/

/*==============================================================*/

/* Table: MALDI */

/*==============================================================*/

create table MALDI (

MALDI_ID NUMBER(8) not null,


laser_wavelength float not null,

laser_power float,

matrix_type varchar(100),

grid_voltage float not null,

acceleration_voltage float not null,

ion_mode varchar(50) not null,













constraint PK_MALDI primary key (MALDI_ID)

)

/

/*==============================================================*/

/* Table: MASSSPECEXPERIMENT */

/*==============================================================*/

create table MASSSPECEXPERIMENT (

mass_spec_experiment_ID NUMBER(8) not

null,

BIOASSAY_TREATMENT_ID NUMBER(8) not

null,

MSMachineID NUMBER(8),


parameters_file varchar(200),












constraint PK_MASSSPECEXPERIMENT primary key

(mass_spec_experiment_ID)

)

/


/*==============================================================*/

/* Table: MASSSPECMACHINE */

/*==============================================================*/

create table MASSSPECMACHINE (

mass_spec_machine_ID NUMBER(8) not null,


manufacturer varchar(200) not null,

model_name varchar(200) not null,

software_version varchar(200) not null,












constraint PK_MASSSPECMACHINE primary key

(mass_spec_machine_ID)

)

/

/*==============================================================*/

/* Table: MATCHEDSPOTS */

/*==============================================================*/

create table MATCHEDSPOTS (

matched_spots_ID NUMBER(8) not null,

multiple_analysis_ID NUMBER(8) not null,

identified_spot_ID NUMBER(8) not null,












constraint PK_MATCHED_SPOTS primary key

(matched_Spots_ID, identified_spot_ID)

)

/

/*==============================================================*/

/* Table: MOBILEPHASECOMPONENT */

/*==============================================================*/

create table MOBILEPHASECOMPONENT (



concentration float not null,













constraint PK_MOBILEPHASECOMPONENT primary key (id)

)

/

/*==============================================================*/

/* Index: MPC_IND */

/*==============================================================*/

create index MPC_IND on MOBILEPHASECOMPONENT (

lc_column ASC

)

/

/*==============================================================*/

/* Table: MSMSFRACTION */

/*==============================================================*/

create table MSMSFRACTION (


msms_fraction_ID NUMBER(8) not null,

target_m_to_z float not null,

plus_or_minus float not null,












constraint PK_MSMSFRACTION primary key (msms_fraction_ID)

)

/

/*==============================================================*/

/* Table: MULTIPLEANALYSIS */

/*==============================================================*/

create table MULTIPLEANALYSIS (

multiple_analysis_ID NUMBER(8) not null,

analysis_type NUMBER(10),


description VARCHAR(300),













constraint PK_MULTIPLE_ANALYSIS primary key

(multiple_analysis_ID)

)

/

/*==============================================================*/

/* Table: MULTIPLEANALYSISGELIA */

/*==============================================================*/

create table MULTIPLEANALYSISGELIA (

MULT_ANALYSIS_GELIA_ID NUMBER(8) NOT NULL,

GEL_IMAGE_ANALYSIS_ID NUMBER(8) NOT NULL,

MULTIPLE_ANALYSIS_ID NUMBER(8) NOT NULL,












constraint PK_MULTIPLE_ANALYSIS_GELIA primary key

(MULT_ANALYSIS_GELIA_ID)

)

/

/*==============================================================*/

/* Table: MZANALYSIS */

/*==============================================================*/

create table MZANALYSIS (

mz_analysis_ID NUMBER(8) not null,

detection_ID NUMBER(8),

type varchar(14)

constraint CKC_TYPE_MZANALYS check (type is null or (

type in (’Quadrupole’,’Hexapole’,’IonTrap’,’CollisionCell’,

’ToF’,’OthermzAnalysis’) )),












constraint PK_MZANALYSIS primary key (mz_analysis_ID)

)

/

/*==============================================================*/

/* Table: ONTOLOGYENTRY */

/*==============================================================*/

create table ONTOLOGYENTRY (


PARENT_ID NUMBER(10),

TABLE_ID NUMBER(8),

ROW_ID NUMBER(12),



URI VARCHAR2(500),

NAME VARCHAR2(100),

CATEGORY VARCHAR2(100) not null,


DEFINITION VARCHAR2(500),












constraint PK_ONTOLOGYENTRY primary key (ONTOLOGY_ENTRY_ID)

)

/

/*==============================================================*/

/* Index: ONTOLOGYENTRY_AK01 */

/*==============================================================*/

create unique index ONTOLOGYENTRY_AK01 on ONTOLOGYENTRY (

CATEGORY ASC,

VALUE ASC

)

/

/*==============================================================*/

/* Index: ONTOLOGYENTRY_IND01 */

/*==============================================================*/

create index ONTOLOGYENTRY_IND01 on ONTOLOGYENTRY (

PARENT_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


TABLE_ID ASC


)

/

/*==============================================================*/


/*==============================================================*/



)

/

/*==============================================================*/

/* Table: OTHERIONISATION */

/*==============================================================*/

create table OTHERIONISATION (

otherIonisation_ID NUMBER(8) not null,


ONTOLOGY_ENTRY_ID NUMBER(10),

name varchar(100) not null,













constraint PK_OTHERIONISATION primary key (otherIonisation_ID)

)

/

/*==============================================================*/

/* Table: OTHERMZANALYSIS */

/*==============================================================*/

create table OTHERMZANALYSIS (

other_mz_analysis_ID NUMBER(8) not null,


ONTOLOGY_ENTRY_ID NUMBER(10),

name varchar(50) not null,













constraint PK_OTHERMZANALYSIS primary key

(other_mz_analysis_ID)

)

/

/*==============================================================*/

/* Table: PEAK */

/*==============================================================*/

create table PEAK (

peak_ID NUMBER(8) not null,

m_to_z float not null,

abundance float,

multiplicity NUMBER(8),

peak_list_ID NUMBER(8) not null,












constraint PK_PEAK primary key (peak_ID, peak_list_ID)

)

/

/*==============================================================*/

/* Table: PEAKLIST */

/*==============================================================*/

create table PEAKLIST (

peak_list_ID NUMBER(8) not null,

mass_spec_experiment_ID NUMBER(8),

list_type varchar(11) not null

constraint CKC_LIST_TYPE_PEAKLIST check (list_type in

(’Full List’,’Edited List’,’MSMS Result’)),


mass_value_type varchar(50),












constraint PK_PEAKLIST primary key (peak_list_ID)

)

/

/*==============================================================*/

/* Table: PEPTIDEHIT */

/*==============================================================*/


create table PEPTIDEHIT (

peptide_hit_ID NUMBER(8) not null,

score float not null,

score_type varchar(100) not null,

sequence varchar(100) not null,

information varchar(100),

probability float,

db_search_ID NUMBER(8),

database_entry_ID NUMBER(8),

constraint PK_PEPTIDEHIT primary key (peptide_hit_ID)

)

/

/*==============================================================*/

/* Table: PERCENTX */

/*==============================================================*/

create table PERCENTX (

Percent_ID NUMBER(8) not null,

lc_column NUMBER(8),

GradientStep_ID NUMBER(8),

percentage float not null,

mobile_phase_component NUMBER(8) not null,

gradient_step_lc_column NUMBER(8) not null,

gradient_step_id NUMBER(8) not null,












constraint PK_PERCENTX primary key (Percent_ID)

)

/

/*==============================================================*/

/* Table: PHYSICALGELITEM */

/*==============================================================*/

create table PHYSICALGELITEM (


gel2D NUMBER(8),

Gel1D_ID NUMBER(8),


ProteinRecord_ID NUMBER(8),













constraint PK_PHYSICALGELITEM primary key (physicalGelItem_ID)

)

/

/*==============================================================*/

/* Table: PROCESSIMPLEMENTATION */

/*==============================================================*/

create table PROCESSIMPLEMENTATION (

PROCESS_IMPLEMENTATION_ID NUMBER(5) not null,

PROCESS_TYPE_ID NUMBER(10) not null,

NAME VARCHAR2(100),












constraint PK_PROCESSIMPLEMENTATION primary key

(PROCESS_IMPLEMENTATION_ID)

)

/

/*==============================================================*/

/* Index: PROCESSIMPLEMENTATION_IND_01 */

/*==============================================================*/

create index PROCESSIMPLEMENTATION_IND_01 on PROCESSIMPLEMENTATION

(

PROCESS_TYPE_ID ASC

)

/

/*==============================================================*/

/* Table: PROCESSIMPLEMENTATIONPARAM */

/*==============================================================*/

create table PROCESSIMPLEMENTATIONPARAM (

PROCESS_IMPLEMETATION_PARAM_ID NUMBER(5) not null,
















constraint PK_PROCESSIMPPARAM primary key

(PROCESS_IMPLEMETATION_PARAM_ID)

)

/

/*==============================================================*/

/* Index: PROCESSIMPPARAM_IND01 */

/*==============================================================*/

create index PROCESSIMPPARAM_IND01 on PROCESSIMPLEMENTATIONPARAM (

PROCESS_IMPLEMENTATION_ID ASC

)

/

/*==============================================================*/

/* Table: PROCESSINVOCATION */

/*==============================================================*/

create table PROCESSINVOCATION (

PROCESS_INVOCATION_ID NUMBER(5) not null,


PROCESS_INVOCATION_DATE DATE not null,













constraint PK_PROCESSINV primary key (PROCESS_INVOCATION_ID)

)

/

/*==============================================================*/

/* Index: PROCESSINVOCATION_IND01 */

/*==============================================================*/

create index PROCESSINVOCATION_IND01 on PROCESSINVOCATION (

PROCESS_IMPLEMENTATION_ID ASC

)

/

/*==============================================================*/

/* Table: PROCESSINVOCATIONPARAM */

/*==============================================================*/

create table PROCESSINVOCATIONPARAM (

PROCESS_INVOCATION_PARAM_ID NUMBER(8) not null,















constraint PK_PROCESSINVOCATIONPARAM primary key

(PROCESS_INVOCATION_PARAM_ID)

)

/

/*==============================================================*/

/* Index: PROCESSINVOCATIONPARAM_IND01 */

/*==============================================================*/

create index PROCESSINVOCATIONPARAM_IND01 on

PROCESSINVOCATIONPARAM (

PROCESS_INVOCATION_ID ASC

)

/

/*==============================================================*/

/* Table: PROCESSINVQUANTIFICATION */

/*==============================================================*/

create table PROCESSINVQUANTIFICATION (

PROCESS_INV_QUANTIFICATION_ID NUMBER(8) not null,














constraint PK_PROCESSINVQUANT primary key

(PROCESS_INV_QUANTIFICATION_ID)

)

/

/*==============================================================*/

/* Index: PROCESSINVQUANTIFICATION_IND01 */

/*==============================================================*/

create index PROCESSINVQUANTIFICATION_IND01 on

PROCESSINVQUANTIFICATION (


)

/

/*==============================================================*/

/* Index: PROCESSINVQUANTIFICATION_IND02 */

/*==============================================================*/

create index PROCESSINVQUANTIFICATION_IND02 on


PROCESSINVQUANTIFICATION (

QUANTIFICATION_ID ASC

)

/

/*==============================================================*/

/* Table: PROCESSIO */

/*==============================================================*/

create table PROCESSIO (

PROCESS_IO_ID NUMBER(12) not null,



INPUT_RESULT_ID NUMBER(12) not null,

INPUT_ROLE VARCHAR2(50),

OUTPUT_RESULT_ID NUMBER(12) not null,












constraint PK_PROCESSIO primary key (PROCESS_IO_ID)

)

/

/*==============================================================*/

/* Index: PROCESSIO_IND01 */

/*==============================================================*/

create index PROCESSIO_IND01 on PROCESSIO (

TABLE_ID ASC

)

/

/*==============================================================*/

/* Index: PROCESSIO_IND_01 */

/*==============================================================*/

create index PROCESSIO_IND_01 on PROCESSIO (

OUTPUT_RESULT_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/



)

/

/*==============================================================*/


/*==============================================================*/


INPUT_RESULT_ID ASC,

OUTPUT_RESULT_ID ASC,

TABLE_ID ASC

)

/

/*==============================================================*/

/* Table: PROCESSIOELEMENT */

/*==============================================================*/

create table PROCESSIOELEMENT (

PROCESS_IO_ELEMENT_ID NUMBER(10) not null,

PROCESS_IO_ID NUMBER(12) not null,













constraint PK_PROCESSIOELEMENT primary key

(PROCESS_IO_ELEMENT_ID)

)

/

/*==============================================================*/

/* Index: PROCESSIOELEMENT_IND01 */

/*==============================================================*/

create index PROCESSIOELEMENT_IND01 on PROCESSIOELEMENT (

PROCESS_IO_ID ASC

)

/

/*==============================================================*/

/* Table: PROCESSRESULT */

/*==============================================================*/

create table PROCESSRESULT (

PROCESS_RESULT_ID NUMBER(12) not null,

VALUE FLOAT not null,













constraint PK_RESULT5 primary key (PROCESS_RESULT_ID)


)

/

/*==============================================================*/

/* Index: PROCESSRESULT_IND01 */

/*==============================================================*/

create index PROCESSRESULT_IND01 on PROCESSRESULT (

PROCESS_RESULT_ID ASC,

VALUE ASC

)

/

/*==============================================================*/

/* Index: PROCESSRESULT_IND02 */

/*==============================================================*/

create index PROCESSRESULT_IND02 on PROCESSRESULT (

UNIT_TYPE_ID ASC

)

/

/*==============================================================*/

/* Table: PROTEINHIT */

/*==============================================================*/

create table PROTEINHIT (

protein_hit_ID NUMBER(8) not null,

ProteinRecord_ID NUMBER(8),

peptide_hit_ID NUMBER(8),













score float,

percent_seq_coverage float,

peptides_matched VARCHAR2(100),

e_value float,

db_search_ID NUMBER(8),

constraint PK_PROTEINHIT primary key (protein_hit_ID)

)

/

/*==============================================================*/

/* Table: PROTEINMODIFICATION */

/*==============================================================*/

create table PROTEINMODIFICATION (

protein_modification_ID NUMBER(8) not null,

modification_type NUMBER(10),

protein_record_ID NUMBER(8),

start_pos NUMBER(8),

end_pos NUMBER(8),













constraint PK_PROTEINMODIFICATION primary key

(protein_modification_ID)

)

/

/*==============================================================*/

/* Table: PROTEINRECORD */

/*==============================================================*/

create table PROTEINRECORD (

protein_record_ID NUMBER(8) not null,












TAXON_NAME_ID NUMBER(10),

protein_name VARCHAR2(100),

pI FLOAT(126),

mW FLOAT(126),

constraint PK_PROTEINRECORD primary key (protein_record_ID)

)

/

/*==============================================================*/

/* Table: PROTEINRECORDENTRY */

/*==============================================================*/

create table PROTEINRECORDENTRY (

pr_record_entry_id NUMBER(8) not null,

protein_record_ID NUMBER(8) not null,

database_entry_ID NUMBER(8) not null,













constraint PK_PROTEINRECORDENTRY primary key

(pr_record_entry_id)

)

/

/*==============================================================*/

/* Table: PROTEOMEASSAY */

/*==============================================================*/

create table PROTEOMEASSAY (

PROTEOME_ASSAY_ID NUMBER(8) not null,


ASSAY_DATE DATE,

OPERATOR_ID NUMBER(10) not null,



NAME VARCHAR2(100),













constraint PK_PROTEOME_ASSAY primary key (PROTEOME_ASSAY_ID)

)

/

/*==============================================================*/

/* Index: PROTEOME_ASSAY_IND02 */

/*==============================================================*/

create index PROTEOME_ASSAY_IND02 on PROTEOMEASSAY (

OPERATOR_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


PROTOCOL_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/



)

/

/*==============================================================*/


/*==============================================================*/


NAME ASC

)

/

/*==============================================================*/

/* Table: PROTOCOL */

/*==============================================================*/

create table PROTOCOL (


PROTOCOL_TYPE_ID NUMBER(10) not null,

SOFTWARE_TYPE_ID NUMBER(10),

HARDWARE_TYPE_ID NUMBER(10),

BIBLIOGRAPHIC_REFERENCE_ID NUMBER(10),




URI VARCHAR2(100),

PROTOCOL_DESCRIPTION VARCHAR2(4000),

HARDWARE_DESCRIPTION VARCHAR2(500),

SOFTWARE_DESCRIPTION VARCHAR2(500),












constraint PK_PROTOCOL primary key (PROTOCOL_ID)

)

/

/*==============================================================*/

/* Index: PROTOCOL_IND01 */

/*==============================================================*/

create index PROTOCOL_IND01 on PROTOCOL (

PROTOCOL_TYPE_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


SOFTWARE_TYPE_ID ASC

)

/

/*==============================================================*/



/*==============================================================*/


HARDWARE_TYPE_ID ASC

)

/

/*==============================================================*/

/* Table: PROTOCOLPARAM */

/*==============================================================*/

create table PROTOCOLPARAM (




DATA_TYPE_ID NUMBER(10),














constraint PK_PROTOCOLPARAM primary key (PROTOCOL_PARAM_ID)

)

/

/*==============================================================*/

/* Index: PROTOCOLPARAM_IND01 */

/*==============================================================*/

create index PROTOCOLPARAM_IND01 on PROTOCOLPARAM (

DATA_TYPE_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


UNIT_TYPE_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


PROTOCOL_ID ASC

)

/

/*==============================================================*/

/* Table: QUADRUPOLE */

/*==============================================================*/

create table QUADRUPOLE (

quadrupole_ID NUMBER(8) not null,















constraint PK_QUADRUPOLE primary key (quadrupole_ID)

)

/

/*==============================================================*/

/* Table: QUANTIFICATION */

/*==============================================================*/

create table QUANTIFICATION (



OPERATOR_ID NUMBER(10),


RESULT_TABLE_ID NUMBER(5),

QUANTIFICATION_DATE DATE,

NAME VARCHAR2(100),

URI VARCHAR2(500),












constraint PK_QUANTIFICATION primary key (QUANTIFICATION_ID)

)

/

/*==============================================================*/

/* Index: QUANTIFICATION_IND01 */

/*==============================================================*/

create index QUANTIFICATION_IND01 on QUANTIFICATION (

ACQUISITION_ID ASC

)

/


/*==============================================================*/


/*==============================================================*/


OPERATOR_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


PROTOCOL_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


RESULT_TABLE_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


NAME ASC

)

/

/*==============================================================*/

/* Table: QUANTIFICATIONPARAM */

/*==============================================================*/

create table QUANTIFICATIONPARAM (

QUANTIFICATION_PARAM_ID NUMBER(5) not null,


PROTOCOL_PARAM_ID NUMBER(10),














constraint PK_QUANTIFICATIONPARAM primary key

(QUANTIFICATION_PARAM_ID)

)

/

/*==============================================================*/

/* Index: QUANTPARAM_AK01 */

/*==============================================================*/

create unique index QUANTPARAM_AK01 on QUANTIFICATIONPARAM (

NAME ASC,


)

/

/*==============================================================*/

/* Table: RELATEDACQUISITION */

/*==============================================================*/

create table RELATEDACQUISITION (

RELATED_ACQUISITION_ID NUMBER(4) not null,


ASSOCIATED_ACQUISITION_ID NUMBER(8) not null,

NAME VARCHAR2(100),

DESIGNATION VARCHAR2(50),

ASSOCIATED_DESIGNATION VARCHAR2(50),












constraint PK_RELASSAY primary key (RELATED_ACQUISITION_ID)

)

/

/*==============================================================*/

/* Index: RELATEDACQUISITION_IND01 */

/*==============================================================*/

create index RELATEDACQUISITION_IND01 on RELATEDACQUISITION (

ACQUISITION_ID ASC

)

/

/*==============================================================*/

/* Index: RELATEDACQUISITION_IND02 */

/*==============================================================*/

create index RELATEDACQUISITION_IND02 on RELATEDACQUISITION (

ASSOCIATED_ACQUISITION_ID ASC

)

/

/*==============================================================*/

/* Table: RELATEDQUANTIFICATION */

/*==============================================================*/

create table RELATEDQUANTIFICATION (

RELATED_QUANTIFICATION_ID NUMBER(4) not null,



ASSOCIATED_QUANTIFICATION_ID NUMBER(8) not null,

NAME VARCHAR2(100),

DESIGNATION VARCHAR2(50),

ASSOCIATED_DESIGNATION VARCHAR2(50),












constraint PK_RELATEDQUANTIFICATION primary key

(RELATED_QUANTIFICATION_ID)

)

/

/*==============================================================*/

/* Index: RELATEDQUANTIFICATION_IND01 */

/*==============================================================*/

create index RELATEDQUANTIFICATION_IND01 on RELATEDQUANTIFICATION

(


)

/

/*==============================================================*/

/* Index: RELATEDQUANTIFICATION_IND02 */

/*==============================================================*/

create index RELATEDQUANTIFICATION_IND02 on RELATEDQUANTIFICATION

(

ASSOCIATED_QUANTIFICATION_ID ASC

)

/

/*==============================================================*/

/* Table: SPOTMEASURESIMP */

/*==============================================================*/

create table SPOTMEASURESIMP (

SPOT_MEASURES_ID NUMBER(10) not null,


FLOAT1 FLOAT,

FLOAT2 FLOAT,

FLOAT3 FLOAT,

FLOAT4 FLOAT,

FLOAT5 FLOAT,

FLOAT6 FLOAT,

FLOAT7 FLOAT,

FLOAT8 FLOAT,

FLOAT9 FLOAT,

FLOAT10 FLOAT,

FLOAT11 FLOAT,

FLOAT12 FLOAT,

FLOAT13 FLOAT,

FLOAT14 FLOAT,

INT1 NUMBER(12),

INT2 NUMBER(12),

INT3 NUMBER(12),

INT4 NUMBER(12),

INT5 NUMBER(12),

INT6 NUMBER(12),

INT7 NUMBER(12),

INT8 NUMBER(12),

INT9 NUMBER(12),

INT10 NUMBER(12),

INT11 NUMBER(12),

INT12 NUMBER(12),

INT13 NUMBER(12),

INT14 NUMBER(12),

INT15 NUMBER(12),

TINYINT1 NUMBER(3),

TINYINT2 NUMBER(3),

TINYINT3 NUMBER(3),




CHAR1 VARCHAR2(5),

CHAR2 VARCHAR2(5),

CHAR3 VARCHAR2(5),

CHAR4 VARCHAR2(5),



















constraint PK_SPOT_MEASURES_IMP primary key (SPOT_MEASURES_ID)

)

/

/*==============================================================*/

/* Table: SPOTRATIO */

/*==============================================================*/

create table SPOTRATIO (

spotRatio_ID NUMBER(8) not null,

first_DIGESingleSpot_ID NUMBER(8) not null,

second_DIGESingleSpot_ID NUMBER(8) not null,













constraint PK_SPOTRATIO primary key (spotRatio_ID)

)

/

/*==============================================================*/

/* Table: STUDY */

/*==============================================================*/

create table STUDY (


CONTACT_ID NUMBER(12) not null,

BIBLIOGRAPHIC_REFERENCE_ID NUMBER(10),
















constraint PK_STUDY primary key (STUDY_ID)

)

/

/*==============================================================*/

/* Index: STUDY_AK01 */

/*==============================================================*/

create unique index STUDY_AK01 on STUDY (

NAME ASC

)

/

/*==============================================================*/

/* Index: STUDY_IND01 */

/*==============================================================*/

create index STUDY_IND01 on STUDY (

BIBLIOGRAPHIC_REFERENCE_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


CONTACT_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/



)

/

/*==============================================================*/

/* Table: STUDYASSAY */

/*==============================================================*/

create table STUDYASSAY (

STUDY_ASSAY_ID NUMBER(8) not null,














constraint PK_STUDYASSAY primary key (STUDY_ASSAY_ID)

)

/

/*==============================================================*/

/* Index: STUDYASSAY_IND01 */

/*==============================================================*/

create index STUDYASSAY_IND01 on STUDYASSAY (

ASSAY_ID ASC

)

/

/*==============================================================*/

/* Index: STUDYASSAY_IND02 */

/*==============================================================*/

create index STUDYASSAY_IND02 on STUDYASSAY (

STUDY_ID ASC

)

/

/*==============================================================*/

/* Table: STUDYASSAYPROT */

/*==============================================================*/

create table STUDYASSAYPROT (

STUDY_ASSAY_ID NUMBER(8) not null,















constraint PK_STUDYASSAY_PROT primary key (STUDY_ASSAY_ID)

)

/

/*==============================================================*/

/* Index: STUDYASSAY_PROT_IND02 */

/*==============================================================*/

create index STUDYASSAY_PROT_IND02 on STUDYASSAYPROT (

STUDY_ID ASC

)

/

/*==============================================================*/

/* Table: STUDYBIOMATERIAL */

/*==============================================================*/

create table STUDYBIOMATERIAL (

STUDY_BIO_MATERIAL_ID NUMBER(10) not null,














constraint PK_STUDYBIOMATERIAL primary key

(STUDY_BIO_MATERIAL_ID)

)

/

/*==============================================================*/

/* Index: STUDYBIOMATERIAL_IND01 */

/*==============================================================*/

create index STUDYBIOMATERIAL_IND01 on STUDYBIOMATERIAL (

STUDY_ID ASC

)

/

/*==============================================================*/

/* Index: STUDYBIOMATERIAL_IND02 */

/*==============================================================*/

create index STUDYBIOMATERIAL_IND02 on STUDYBIOMATERIAL (

BIO_MATERIAL_ID ASC

)

/

/*==============================================================*/

/* Table: STUDYDESIGN */

/*==============================================================*/

create table STUDYDESIGN (
















constraint PK_STUDYDESIGN primary key (STUDY_DESIGN_ID)

)

/

/*==============================================================*/

/* Index: STUDYDESIGN_AK01 */

/*==============================================================*/

create unique index STUDYDESIGN_AK01 on STUDYDESIGN (

NAME ASC,

STUDY_ID ASC

)

/

/*==============================================================*/

/* Index: STUDYDESIGN_IND01 */

/*==============================================================*/

create index STUDYDESIGN_IND01 on STUDYDESIGN (

STUDY_ID ASC

)

/

/*==============================================================*/

/* Table: STUDYDESIGNASSAY */

/*==============================================================*/

create table STUDYDESIGNASSAY (

STUDY_DESIGN_ASSAY_ID NUMBER(8) not null,















constraint PK_STUDYDESIGNASSAY primary key

(STUDY_DESIGN_ASSAY_ID)

)

/

/*==============================================================*/

/* Index: STUDYDESIGNASSAY_IND01 */

/*==============================================================*/

create index STUDYDESIGNASSAY_IND01 on STUDYDESIGNASSAY (

ASSAY_ID ASC

)

/

/*==============================================================*/

/* Index: STUDYDESIGNASSAY_IND02 */

/*==============================================================*/

create index STUDYDESIGNASSAY_IND02 on STUDYDESIGNASSAY (

STUDY_DESIGN_ID ASC

)

/

/*==============================================================*/

/* Table: STUDYDESIGNASSAYPROT */

/*==============================================================*/

create table STUDYDESIGNASSAYPROT (

STUDY_DESIGN_ASSAY_ID NUMBER(8) not null,














constraint PK_STUDYDESIGNASSAYPROT primary key

(STUDY_DESIGN_ASSAY_ID)

)

/

/*==============================================================*/

/* Index: STUDYDESIGNASSAYPROT_IND02 */

/*==============================================================*/

create index STUDYDESIGNASSAYPROT_IND02 on STUDYDESIGNASSAYPROT (

STUDY_DESIGN_ID ASC

)

/

/*==============================================================*/

/* Table: STUDYDESIGNDESCRIPTION */

/*==============================================================*/

create table STUDYDESIGNDESCRIPTION (

STUDY_DESIGN_DESCRIPTION_ID NUMBER(5) not null,


DESCRIPTION_TYPE VARCHAR2(100) not null,

DESCRIPTION VARCHAR2(4000) not null,












constraint PK_STUDYDESIGNDESCR primary key

(STUDY_DESIGN_DESCRIPTION_ID)

)

/

/*==============================================================*/

/* Index: STUDYDESIGNDESCRIPTION_IND01 */

/*==============================================================*/

create index STUDYDESIGNDESCRIPTION_IND01 on

STUDYDESIGNDESCRIPTION (

STUDY_DESIGN_ID ASC

)

/

/*==============================================================*/

/* Table: STUDYDESIGNTYPE */

/*==============================================================*/

create table STUDYDESIGNTYPE (

STUDY_DESIGN_TYPE_ID NUMBER(6) not null,














constraint PK_STUDYDESIGNTYPE primary key

(STUDY_DESIGN_TYPE_ID)

)

/

/*==============================================================*/

/* Index: STUDYDESIGNTYPE_IND01 */

/*==============================================================*/


create index STUDYDESIGNTYPE_IND01 on STUDYDESIGNTYPE (

ONTOLOGY_ENTRY_ID ASC

)

/

/*==============================================================*/

/* Index: STUDYDESIGNTYPE_IND02 */

/*==============================================================*/

create index STUDYDESIGNTYPE_IND02 on STUDYDESIGNTYPE (

STUDY_DESIGN_ID ASC

)

/

/*==============================================================*/

/* Table: STUDYFACTOR */

/*==============================================================*/

create table STUDYFACTOR (

STUDY_FACTOR_ID NUMBER(5) not null,


STUDY_FACTOR_TYPE_ID NUMBER(10),














constraint PK_STUDYFACTOR primary key (STUDY_FACTOR_ID)

)

/

/*==============================================================*/

/* Index: STUDYFACTOR_AK01 */

/*==============================================================*/

create unique index STUDYFACTOR_AK01 on STUDYFACTOR (

NAME ASC,

STUDY_DESIGN_ID ASC

)

/

/*==============================================================*/

/* Index: STUDYFACTOR_IND01 */

/*==============================================================*/

create index STUDYFACTOR_IND01 on STUDYFACTOR (

STUDY_FACTOR_TYPE_ID ASC

)

/

/*==============================================================*/

/* Table: STUDYFACTORVALUE */

/*==============================================================*/

create table STUDYFACTORVALUE (

STUDY_FACTOR_VALUE_ID NUMBER(8) not null,



VALUE_ONTOLOGY_ENTRY_ID NUMBER(10),

STRING_VALUE VARCHAR2(100),

MEASUREMENT_UNIT_OE_ID NUMBER(10),

MEASUREMENT_TYPE VARCHAR2(10),

MEASUREMENT_KIND VARCHAR2(20),












constraint PK_STUDYFACTORVAL primary key

(STUDY_FACTOR_VALUE_ID)

)

/

/*==============================================================*/

/* Index: STUDYFACTORVALUE_IND01 */

/*==============================================================*/

create index STUDYFACTORVALUE_IND01 on STUDYFACTORVALUE (

ASSAY_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


STUDY_FACTOR_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


VALUE_ONTOLOGY_ENTRY_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


MEASUREMENT_UNIT_OE_ID ASC

)

/


/*==============================================================*/

/* Table: STUDYFACTORVALUEPROT */

/*==============================================================*/

create table STUDYFACTORVALUEPROT (

STUDY_FACTOR_VALUE_ID NUMBER(8) not null,


PROTEOME_ASSAY_ID NUMBER(8) not null,

VALUE_ONTOLOGY_ENTRY_ID NUMBER(10),

STRING_VALUE VARCHAR2(100),

MEASUREMENT_UNIT_OE_ID NUMBER(10),

MEASUREMENT_TYPE VARCHAR2(10),

MEASUREMENT_KIND VARCHAR2(20),












constraint PK_STUDYFACVAL_PROT primary key


)

/

/*==============================================================*/

/* Table: TANDEMSEQUENCEDATA */

/*==============================================================*/

create table TANDEMSEQUENCEDATA (

tandem_sequence_ID NUMBER(8) not null,

db_search_parameters_ID NUMBER(8),

source_type varchar(100) not null,

sequence varchar(100) not null,












constraint PK_TANDEMSEQUENCEDATA primary key

(tandem_sequence_ID)

)

/

/*==============================================================*/

/* Table: TOF */

/*==============================================================*/

create table TOF (

TOF_ID NUMBER(8) not null,


reflectron_state varchar(4) not null

constraint CKC_REFLECTRON_STATE_TOF check

(reflectron_state in (’On’,’Off’,’None’)),

internal_length float not null,












constraint PK_TOF primary key (TOF_ID)

)

/

/*==============================================================*/

/* Table: TREATEDANALYTE */

/*==============================================================*/

create table TREATEDANALYTE (

chemical_treatment_ID NUMBER(8),

treated_analyte_ID NUMBER(8) not null,













constraint PK_TREATEDANALYTE primary key (treated_analyte_ID)

)

/

/*==============================================================*/

/* Table: TREATMENT */

/*==============================================================*/

create table TREATMENT (


ORDER_NUM NUMBER(3) not null,


TREATMENT_TYPE_ID NUMBER(10) not null,


NAME VARCHAR2(100),













constraint PK_TREATMENT primary key (TREATMENT_ID)

)

/

/*==============================================================*/

/* Index: TREATMENT_IND01 */

/*==============================================================*/

create index TREATMENT_IND01 on TREATMENT (

BIO_MATERIAL_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


TREATMENT_TYPE_ID ASC

)

/

/*==============================================================*/


/*==============================================================*/


PROTOCOL_ID ASC

)

/

/*==============================================================*/

/* Table: TREATMENTPARAM */

/*==============================================================*/

create table TREATMENTPARAM (

TREATMENT_PARAM_ID NUMBER(10) not null,















constraint PK_TREATMENTPARAM primary key (TREATMENT_PARAM_ID)

)

/

/*==============================================================*/

/* Index: TREATMENTPARAM_IND01 */

/*==============================================================*/

create index TREATMENTPARAM_IND01 on TREATMENTPARAM (


)

/

/*==============================================================*/

/* Index: TREATMENTPARAM_IND02 */

/*==============================================================*/

create index TREATMENTPARAM_IND02 on TREATMENTPARAM (

TREATMENT_ID ASC

)

/

/*==============================================================*/

/* View: AFFYMETRIXCEL */

/*==============================================================*/

create or replace view AFFYMETRIXCEL(ELEMENT_RESULT_ID,

ELEMENT_ID,

COMPOSITE_ELEMENT_RESULT_ID, QUANTIFICATION_ID, SUBCLASS_VIEW,

MEAN, STDV, NPIXELS, MODIFICATION_DATE, USER_READ, USER_WRITE,

GROUP_READ, GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID,

ROW_GROUP_ID, ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as

SELECT

element_result_id,

element_id,

composite_element_result_id,

quantification_id,

subclass_view,

foreground AS mean,

float1 AS stdv,

int3 AS npixels,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,

row_alg_invocation_id

FROM ElementResultImp

WHERE subclass_view = ’AffymetrixCEL’

with check option

/

/*==============================================================*/

/* View: AFFYMETRIXMAS4 */

/*==============================================================*/

create or replace view AFFYMETRIXMAS4(COMPOSITE_ELEMENT_RESULT_ID,

COMPOSITE_ELEMENT_ID, QUANTIFICATION_ID, SUBCLASS_VIEW,

POSITIVE_PROBE_PAIRS, NEGATIVE_PROBE_PAIRS, NUM_PROBE_PAIRS_USED,

PAIRS_IN_AVERAGE, LOG_AVERAGE_RATIO, AVERAGE_DIFFERENCE,

ABSOLUTE_CALL, MODIFICATION_DATE, USER_READ, USER_WRITE,


ROW_GROUP_ID, ROW_PROJECT_ID,

ROW_ALG_INVOCATION_ID) as


SELECT


composite_element_id,

quantification_id,

subclass_view,

tinyint1 AS positive_probe_pairs,

tinyint2 AS negative_probe_pairs,

tinyint3 AS num_probe_pairs_used,

smallint1 AS pairs_in_average,

float1 AS log_average_ratio,

float2 AS average_difference,

string1 AS absolute_call,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,


FROM CompositeElementResultImp

WHERE subclass_view = ’AffymetrixMAS4’

with check option

/

/*==============================================================*/

/* View: AFFYMETRIXMAS5 */

/*==============================================================*/

create or replace view AFFYMETRIXMAS5(COMPOSITE_ELEMENT_RESULT_ID,

SUBCLASS_VIEW, COMPOSITE_ELEMENT_ID, QUANTIFICATION_ID, SIGNAL,

DETECTION, DETECTION_P_VALUE, STAT_PAIRS, STAT_PAIRS_USED,

MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ,

GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID,


SELECT


subclass_view,


quantification_id,

float1 AS signal,

char1 AS detection,

float2 AS detection_p_value,

smallint1 AS stat_pairs,

smallint2 AS stat_pairs_used,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,



WHERE SUBCLASS_VIEW = ’AffymetrixMAS5’

with check option

/

/*==============================================================*/

/* View: ARRAYVISIONELEMENTRESULT */

/*==============================================================*/

create or replace view ARRAYVISIONELEMENTRESULT(ELEMENT_RESULT_ID,

SUBCLASS_VIEW, ELEMENT_ID, COMPOSITE_ELEMENT_RESULT_ID,

QUANTIFICATION_ID, FOREGROUND, BACKGROUND, SD, MAD,

SIGNAL_TO_NOISE, PERCENT_REMOVED, PERCENT_REPLACED,

PERCENT_AT_FLOOR, PERCENT_AT_CEILING, BKG_PERCENT_AT_FLOOR,

BKG_PERCENT_AT_CEILING, X, Y, AREA, FLAG, MODIFICATION_DATE,

USER_READ, USER_WRITE, GROUP_READ, GROUP_WRITE,

OTHER_READ, OTHER_WRITE, ROW_USER_ID, ROW_GROUP_ID,

ROW_PROJECT_ID,


SELECT

element_result_id,

subclass_view,

element_id,


quantification_id,

foreground,

background,

foreground_sd AS sd,

float1 AS mad,

float2 AS signal_to_noise,

float3 AS percent_removed,

float4 AS percent_replaced,

float5 AS percent_at_floor,

float6 AS percent_at_ceiling,

float7 AS bkg_percent_at_floor,

float8 AS bkg_percent_at_ceiling,

tinystring1 AS x,

tinystring2 AS y,

tinystring3 AS area,

tinyint1 AS flag,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,



WHERE subclass_view = ’ArrayVisionElementResult’

with check option

/

/*==============================================================*/

/* View: BIOMATERIAL */

/*==============================================================*/

create or replace view BIOMATERIAL(BIO_MATERIAL_ID, SUBCLASS_VIEW,

BIO_MATERIAL_TYPE_ID, EXTERNAL_DATABASE_RELEASE_ID, SOURCE_ID,


GROUP_WRITE,


ROW_PROJECT_ID,


SELECT

bio_material_id,


subclass_view,

bio_material_type_id,

external_database_release_id,

source_id,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,


FROM BioMaterialImp

WHERE subclass_view = ’BioMaterial’

with check option

/

/*==============================================================*/

/* View: BIOSAMPLE */

/*==============================================================*/

create or replace view BIOSAMPLE(BIO_MATERIAL_ID, SUBCLASS_VIEW,

BIO_MATERIAL_TYPE_ID, EXTERNAL_DATABASE_RELEASE_ID, SOURCE_ID,

NAME, DESCRIPTION, MODIFICATION_DATE, USER_READ, USER_WRITE,



SELECT

bio_material_id,

subclass_view,



source_id,

string1 AS name,

string2 AS description,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,


FROM BioMaterialImp

WHERE subclass_view = ’BioSample’

with check option

/

/*==============================================================*/

/* View: BIOSOURCE */

/*==============================================================*/

create or replace view BIOSOURCE(BIO_MATERIAL_ID, SUBCLASS_VIEW,

TAXON_ID, BIO_MATERIAL_TYPE_ID, BIO_SOURCE_PROVIDER_ID,

EXTERNAL_DATABASE_RELEASE_ID, SOURCE_ID, NAME, DESCRIPTION,


GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID,

ROW_GROUP_ID,

ROW_PROJECT_ID, ROW_ALG_INVOCATION_ID) as

SELECT BioMaterialImp.bio_material_id,

BioMaterialImp.subclass_view,

BioMaterialImp.taxon_id,

BioMaterialImp.bio_material_type_id,

BioMaterialImp.bio_source_provider_id,

BioMaterialImp.external_database_release_id,

BioMaterialImp.source_id,

BioMaterialImp.string1 AS name,

BioMaterialImp.string2 AS description,

BioMaterialImp.modification_date,

BioMaterialImp.user_read,

BioMaterialImp.user_write,

BioMaterialImp.group_read,

BioMaterialImp.group_write,

BioMaterialImp.other_read,

BioMaterialImp.other_write,

BioMaterialImp.row_user_id,

BioMaterialImp.row_group_id,

BioMaterialImp.row_project_id,

BioMaterialImp.row_alg_invocation_id

FROM BioMaterialImp

where subclass_view=’BioSource’

with check option

/

/*==============================================================*/

/* View: COMPOSITEELEMENT */

/*==============================================================*/

create or replace view COMPOSITEELEMENT(COMPOSITE_ELEMENT_ID,

SUBCLASS_VIEW, PARENT_ID, ARRAY_ID, EXTERNAL_DATABASE_RELEASE_ID,

SOURCE_ID, MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ,

GROUP_WRITE, OTHER_READ, OTHER_WRITE, ROW_USER_ID, ROW_GROUP_ID,


SELECT


subclass_view,

parent_id,

array_id,


source_id,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,


FROM CompositeElementImp

WHERE subclass_view = ’CompositeElement’

with check option

/

/*==============================================================*/

/* View: COMPOSITEELEMENTRESULT */

/*==============================================================*/

create or replace view COMPOSITEELEMENTRESULT(

COMPOSITE_ELEMENT_RESULT_ID, SUBCLASS_VIEW, COMPOSITE_ELEMENT_ID,

QUANTIFICATION_ID, MODIFICATION_DATE, USER_READ, USER_WRITE,




SELECT


subclass_view,


quantification_id,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,



WHERE subclass_view = ’CompositeElementResult’

with check option

/

/*==============================================================*/

/* View: ELEMENT */

/*==============================================================*/

create or replace view ELEMENT(ELEMENT_ID, SUBCLASS_VIEW,

ELEMENT_TYPE_ID, COMPOSITE_ELEMENT_ID,

ARRAY_ID, EXTERNAL_DATABASE_RELEASE_ID, SOURCE_ID,




SELECT

element_id,

subclass_view,

element_type_id,


array_id,


source_id,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,


FROM ElementImp

WHERE subclass_view = ’Element’

with check option

/

/*==============================================================*/

/* View: ELEMENTRESULT */

/*==============================================================*/

create or replace view ELEMENTRESULT(ELEMENT_RESULT_ID,

SUBCLASS_VIEW, ELEMENT_ID, COMPOSITE_ELEMENT_RESULT_ID,

QUANTIFICATION_ID, FOREGROUND, BACKGROUND, FOREGROUND_SD,

BACKGROUND_SD, MODIFICATION_DATE, USER_READ, USER_WRITE,



SELECT

element_result_id,

subclass_view,

element_id,


quantification_id,

foreground,

background,

foreground_sd,

background_sd,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,



WHERE subclass_view = ’ElementResult’

with check option

/

/*==============================================================*/

/* View: GEMTOOLSELEMENTRESULT */

/*==============================================================*/

create or replace view GEMTOOLSELEMENTRESULT(ELEMENT_RESULT_ID,

ELEMENT_ID, COMPOSITE_ELEMENT_RESULT_ID, QUANTIFICATION_ID,

SUBCLASS_VIEW, SIGNAL, SIGNAL_TO_BACKGROUND, AREA_PERCENTAGE,

VISUAL_FLAG, MODIFICATION_DATE, USER_READ, USER_WRITE,



SELECT

element_result_id,

element_id,


quantification_id,

subclass_view,

float1 AS signal,

float2 AS signal_to_background,

float3 AS area_percentage,

tinyint1 AS visual_flag,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,


FROM ElementResultImp WHERE subclass_view =

’GEMToolsElementResult’

with check option

/


/*==============================================================*/

/* View: GENEPIXELEMENTRESULT */

/*==============================================================*/

create or replace view GENEPIXELEMENTRESULT(ELEMENT_RESULT_ID,


SUBCLASS_VIEW, FOREGROUND_SD, BACKGROUND_SD, SPOT_DIAMETER,

FOREGROUND_MEAN, FOREGROUND_MEDIAN, BACKGROUND_MEAN,

BACKGROUND_MEDIAN, PERCENT_OVER_BG_PLUS_ONE_SD,

PERCENT_OVER_BG_PLUS_TWO_SDS, PERCENT_FOREGROUND_SATURATED,

MEAN_OF_RATIOS, MEDIAN_OF_RATIOS, RATIOS_SD, RGN_RATIO,

RGN_R_SQUARED, NUM_FOREGROUND_PIXELS, NUM_BACKGROUND_PIXELS,

FLAG, MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ,

GROUP_WRITE, OTHER_READ, OTHER_WRITE,

ROW_USER_ID, ROW_GROUP_ID, ROW_PROJECT_ID,


SELECT element_result_id,

element_id,


quantification_id,

subclass_view,

foreground_sd,

background_sd,

float1 AS spot_diameter,

float2 AS foreground_mean,

float3 AS foreground_median,

float4 AS background_mean,

float5 AS background_median,

float6 AS percent_over_bg_plus_one_sd,

float7 AS percent_over_bg_plus_two_sds,

float8 AS percent_foreground_saturated,

float9 AS mean_of_ratios,

float10 AS median_of_ratios,

float11 AS ratios_sd,

float12 as rgn_ratio,

float13 as rgn_r_squared,

smallint1 AS num_foreground_pixels,

smallint2 AS num_background_pixels,

tinyint1 AS flag,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,



WHERE subclass_view = ’GenePixElementResult’

with check option

/

/*==============================================================*/

/* View: LABELEDEXTRACT */

/*==============================================================*/

create or replace view LABELEDEXTRACT(BIO_MATERIAL_ID,

SUBCLASS_VIEW, BIO_MATERIAL_TYPE_ID, LABEL_METHOD_ID,


MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ, GROUP_WRITE,



SELECT

bio_material_id,

subclass_view,


label_method_id,


source_id,

string1 AS name,


modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,


FROM BioMaterialImp

WHERE LABEL_METHOD_ID is not null

AND SUBCLASS_VIEW = ’LabeledExtract’

with check option

/

/*==============================================================*/

/* View: MOIDRESULT */

/*==============================================================*/

create or replace view MOIDRESULT(COMPOSITE_ELEMENT_RESULT_ID,

COMPOSITE_ELEMENT_ID,

QUANTIFICATION_ID,

SUBCLASS_VIEW, EXPRESSION, LOWER_BOUND, UPPER_BOUND, LOG_P,

MODIFICATION_DATE, USER_READ, USER_WRITE,



SELECT



quantification_id,

subclass_view,

float1 AS expression,

float2 AS lower_bound,

float3 AS upper_bound,

float4 AS log_p,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,



WHERE subclass_view = ’MOIDResult’

with check option

/


drop view OTHERSPOTMEASURES

/

/*==============================================================*/

/* View: OTHERSPOTMEASURES */

/*==============================================================*/

create or replace view OTHERSPOTMEASURES as

select SPOTMEASURESIMP.SPOT_MEASURES_ID,

SPOTMEASURESIMP.SUBCLASS_VIEW,

SPOTMEASURESIMP.FLOAT1, SPOTMEASURESIMP.FLOAT2,

SPOTMEASURESIMP.FLOAT3,








SPOTMEASURESIMP.INT1,

SPOTMEASURESIMP.INT2, SPOTMEASURESIMP.INT3, SPOTMEASURESIMP.INT4,


SPOTMEASURESIMP.INT6, SPOTMEASURESIMP.INT7, SPOTMEASURESIMP.INT8,


SPOTMEASURESIMP.INT10, SPOTMEASURESIMP.INT11,


SPOTMEASURESIMP.INT13, SPOTMEASURESIMP.INT14,


SPOTMEASURESIMP.TINYINT1, SPOTMEASURESIMP.TINYINT2,

SPOTMEASURESIMP.TINYINT3

, SPOTMEASURESIMP.SMALLINT1, SPOTMEASURESIMP.SMALLINT2,

SPOTMEASURESIMP.SMALLINT3,

SPOTMEASURESIMP.CHAR1, SPOTMEASURESIMP.CHAR2,

SPOTMEASURESIMP.CHAR3,

SPOTMEASURESIMP.CHAR4, SPOTMEASURESIMP.TINYSTRING1,

SPOTMEASURESIMP.TINYSTRING2,

SPOTMEASURESIMP.TINYSTRING3, SPOTMEASURESIMP.SMALLSTRING1,

SPOTMEASURESIMP.SMALLSTRING2, SPOTMEASURESIMP.STRING1,

SPOTMEASURESIMP.STRING2, SPOTMEASURESIMP.MODIFICATION_DATE,

SPOTMEASURESIMP.USER_READ, SPOTMEASURESIMP.USER_WRITE,

SPOTMEASURESIMP.GROUP_READ,

SPOTMEASURESIMP.GROUP_WRITE, SPOTMEASURESIMP.OTHER_READ,

SPOTMEASURESIMP.OTHER_WRITE, SPOTMEASURESIMP.ROW_USER_ID,

SPOTMEASURESIMP.ROW_GROUP_ID, SPOTMEASURESIMP.ROW_PROJECT_ID,

SPOTMEASURESIMP.ROW_ALG_INVOCATION_ID

from SPOTMEASURESIMP

/

/*==============================================================*/

/* View: SAGETAG */

/*==============================================================*/

create or replace view SAGETAG(COMPOSITE_ELEMENT_ID,

SUBCLASS_VIEW,

PARENT_ID, ARRAY_ID, TAG, MODIFICATION_DATE, USER_READ,

USER_WRITE, GROUP_READ, GROUP_WRITE, OTHER_READ, OTHER_WRITE,


ROW_ALG_INVOCATION_ID)

as

SELECT


subclass_view,

parent_id,

array_id,

tinystring1 AS tag,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,



WHERE subclass_view = ’SAGETag’

with check option

/

/*==============================================================*/

/* View: SAGETAGMAPPING */

/*==============================================================*/

create or replace view SAGETAGMAPPING(ELEMENT_ID, SUBCLASS_VIEW,

ARRAY_ID, COMPOSITE_ELEMENT_ID, EXTERNAL_DATABASE_RELEASE_ID,

SOURCE_ID, MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ,



SELECT

element_id,

subclass_view,

array_id,



source_id,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,


FROM ElementImp

WHERE subclass_view = ’SAGETagMapping’

with check option

/

/*==============================================================*/

/* View: SAGETAGRESULT */

/*==============================================================*/

create or replace view SAGETAGRESULT(COMPOSITE_ELEMENT_RESULT_ID,

COMPOSITE_ELEMENT_ID, QUANTIFICATION_ID, SUBCLASS_VIEW,

TAG_COUNT, MODIFICATION_DATE, USER_READ, USER_WRITE, GROUP_READ,



SELECT



quantification_id,

subclass_view,

int1 AS tag_count,

modification_date,


user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,



WHERE subclass_view = ’SAGETagResult’

with check option

/

/*==============================================================*/

/* View: SCANALYZEELEMENTRESULT */

/*==============================================================*/

create or replace view SCANALYZEELEMENTRESULT(ELEMENT_RESULT_ID,


SUBCLASS_VIEW, I, B, BA, SPIX, BGPIX, TOP, LEFT, BOT, RIGHT,

FLAG, MRAT, REGR, CORR, LFRAT, GTB1, GTB2, EDGEA, KSD, KSP,



ROW_PROJECT_ID,


SELECT

element_result_id,

element_id,


quantification_id,

subclass_view,

foreground AS i,

background AS b,

float1 AS ba,

int1 AS spix,

int2 AS bgpix,

int3 AS top,

int4 AS left,

int5 AS bot,

int6 AS right,

tinyint1 AS flag,

float2 AS mrat,

float3 AS regr,

float4 AS corr,

float5 AS lfrat,

float6 AS gtb1,

float7 AS gtb2,

float8 AS edgea,

float9 AS ksd,

float10 AS ksp,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,



WHERE subclass_view = ’ScanAlyzeElementResult’

with check option

/

/*==============================================================*/

/* View: SHORTOLIGO */

/*==============================================================*/

create or replace view SHORTOLIGO(ELEMENT_ID, SUBCLASS_VIEW,

ARRAY_ID, COMPOSITE_ELEMENT_ID, NAME,

X_POSITION, Y_POSITION, SEQUENCE, DESCRIPTION, MODIFICATION_DATE,

USER_READ, USER_WRITE, GROUP_READ,



SELECT

element_id,

subclass_view,

array_id,


smallstring2 AS name,

tinystring1 AS x_position,

tinystring2 AS y_position,

smallstring1 AS sequence,


modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,


FROM ElementImp

WHERE subclass_view = ’ShortOligo’

with check option

/

/*==============================================================*/

/* View: SHORTOLIGOFAMILY */

/*==============================================================*/

create or replace view SHORTOLIGOFAMILY(COMPOSITE_ELEMENT_ID,

SUBCLASS_VIEW, PARENT_ID, ARRAY_ID,


MODIFICATION_DATE, USER_READ, USER_WRITE,


ROW_GROUP_ID, ROW_PROJECT_ID,


SELECT


subclass_view,

parent_id,

array_id,


source_id,

smallstring1 AS name,


modification_date,

user_read,

user_write,

group_read,


group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,



WHERE subclass_view = ’ShortOligoFamily’

with check option

/

/*==============================================================*/

/* View: SPOT */

/*==============================================================*/

create or replace view SPOT(ELEMENT_ID, SUBCLASS_VIEW, ARRAY_ID,

ELEMENT_TYPE_ID,

COMPOSITE_ELEMENT_ID,

EXTERNAL_DATABASE_RELEASE_ID, SOURCE_ID, ARRAY_ROW, ARRAY_COLUMN,

GRID_ROW, GRID_COLUMN, SUB_ROW,

SUB_COLUMN, SEQUENCE_VERIFIED, NAME, DESCRIPTION,




SELECT

element_id,

subclass_view,

array_id,

element_type_id,



source_id,

char1 AS array_row,

char2 AS array_column,

char3 AS grid_row,

char4 AS grid_column,

char5 AS sub_row,

char6 AS sub_column,

tinyint1 AS sequence_verified,

tinystring1 AS name,


modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,


FROM ElementImp

WHERE subclass_view = ’Spot’

with check option

/

/*==============================================================*/

/* View: SPOTELEMENTRESULT */

/*==============================================================*/

create or replace view SPOTELEMENTRESULT(ELEMENT_RESULT_ID,

ELEMENT_ID, COMPOSITE_ELEMENT_RESULT_ID,

QUANTIFICATION_ID, SUBCLASS_VIEW, MEDIAN, MORPH, IQR, MEAN,

BG_MEDIAN, BG_MEAN, BG_SD, VALLEY, MORPH_ERODE,

MORPH_CLOSE_OPEN, AREA, PERIMETER, CIRCULARITY, BADSPOT,

VISUAL_FLAG, MODIFICATION_DATE, USER_READ,

USER_WRITE, GROUP_READ, GROUP_WRITE, OTHER_READ, OTHER_WRITE,



SELECT

element_result_id,

element_id,


quantification_id,

subclass_view,

foreground AS median,

background AS morph,

foreground_sd AS iqr,

float1 AS mean,

float2 AS bg_median,

float3 AS bg_mean,

float4 AS bg_sd,

float5 AS valley,

float6 AS morph_erode,

float7 AS morph_close_open,

int1 AS area,

int2 AS perimeter,

float8 AS circularity,

tinyint1 AS badspot,

tinyint2 AS visual_flag,

modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,


FROM ElementResultImp WHERE subclass_view = ’SpotElementResult’

with check option

/

/*==============================================================*/

/* View: SPOTFAMILY */

/*==============================================================*/

create or replace view SPOTFAMILY(COMPOSITE_ELEMENT_ID,

SUBCLASS_VIEW, PARENT_ID, ARRAY_ID,

EXTERNAL_DATABASE_RELEASE_ID, SOURCE_ID, PLATE_NAME,

WELL_LOCATION, PCR_FAILURE_FLAG, NAME, DESCRIPTION,


OTHER_READ, OTHER_WRITE, ROW_USER_ID,


SELECT


subclass_view,

parent_id,

array_id,


source_id,

smallstring1 AS plate_name,

smallstring2 AS well_location,

tinyint1 AS pcr_failure_flag,


string2 AS name,


modification_date,

user_read,

user_write,

group_read,

group_write,

other_read,

other_write,

row_user_id,

row_group_id,

row_project_id,



where subclass_view = ’SpotFamily’

with check option

/

alter table ACQUISITION

add constraint FK_ACQ_ASSAY foreign key (ASSAY_ID)

references ASSAY (ASSAY_ID) not deferrable

/


add constraint FK_ACQ_CHANNEL foreign key (CHANNEL_ID)

references CHANNEL (CHANNEL_ID) not deferrable

/


add constraint FK_ACQ_PRTCL foreign key (PROTOCOL_ID)

references PROTOCOL (PROTOCOL_ID) not deferrable

/

alter table ACQUISITIONPARAM

add constraint FK_ACQPARAM_ACQ foreign key (ACQUISITION_ID)

references ACQUISITION (ACQUISITION_ID) not deferrable

/

alter table ACQUISITIONPARAM

add constraint FK_ACQPARAM_PRTPRM foreign key

(PROTOCOL_PARAM_ID)

references PROTOCOLPARAM (PROTOCOL_PARAM_ID) not deferrable

/

alter table ANALYSISIMPLEMENTATION

add constraint FK_ANLIMP_ANL foreign key (ANALYSIS_ID)

references ANALYSIS (ANALYSIS_ID) not deferrable

/

alter table ANALYSISIMPLEMENTATIONPARAM

add constraint FK_ANLIMPPARAM_ANLIMP foreign key


references ANALYSISIMPLEMENTATION

(ANALYSIS_IMPLEMENTATION_ID) not deferrable

/

alter table ANALYSISINPUT

add constraint FK_ANLINPUT_ANALYSISINV foreign key


references ANALYSISINVOCATION (ANALYSIS_INVOCATION_ID) not

deferrable

/

alter table ANALYSISINVOCATION

add constraint FK_ANLINV_ANLIMP foreign key


references ANALYSISIMPLEMENTATION

(ANALYSIS_IMPLEMENTATION_ID) not deferrable

/

alter table ANALYSISINVOCATIONPARAM

add constraint FK_ANLPARAM_ANLINV foreign key



deferrable

/

alter table ANALYSISOUTPUT

add constraint FK_ANALYSISOUTPUT4 foreign key



deferrable

/

alter table ANALYTEMEASUREMENT

add constraint FK_ANALYTE__REFERENCE_BIOASSAY foreign key


references BIOASSAYTREATMENT (BIOASSAY_TREATMENT_ID)

/

alter table ANALYTEMEASUREMENT

add constraint FK_ANALYTE__REFERENCE_BIOMATER foreign key

(BIO_MATERIAL_ID)

references BIOMATERIALIMP (BIO_MATERIAL_ID)

/

alter table ARRAY

add constraint FK_ARRAY_ONTO01 foreign key (PLATFORM_TYPE_ID)

references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID) not deferrable

/

alter table ARRAY

add constraint FK_ARRAY_ONTO02 foreign key (SUBSTRATE_TYPE_ID)


/

alter table ARRAY

add constraint FK_ARRAY_PROTOCOL foreign key (PROTOCOL_ID)


/


alter table ARRAYANNOTATION

add constraint FK_ARRAYANN_ARRAY foreign key (ARRAY_ID)

references ARRAY (ARRAY_ID) not deferrable

/

alter table ASSAY

add constraint FK_ASSAY_ARRAY foreign key (ARRAY_ID)


/

alter table ASSAY

add constraint FK_ASSAY_PRTCL foreign key (PROTOCOL_ID)


/

alter table ASSAYBIOMATERIAL

add constraint FK_ASSAYBIOMATERIAL15 foreign key

(BIO_MATERIAL_ID)

references BIOMATERIALIMP (BIO_MATERIAL_ID) not deferrable

/

alter table ASSAYBIOMATERIAL

add constraint FK_ASSAYBIOSOURCE13 foreign key (ASSAY_ID)


/

alter table ASSAYDATAPOINT

add constraint FK_ASSAYDAT_REFERENCE_LCCOLUMN foreign key (id)

references LCCOLUMN (LCColumn_ID)

/

alter table ASSAYGROUP

add constraint FK_ASSAYGRO_REFERENCE_ASSAY foreign key

(ASSAY_ID)

references ASSAY (ASSAY_ID)

/


add constraint FK_ASSAYGRO_REFERENCE_STUDY foreign key

(STUDY_ID)

references STUDY (STUDY_ID)

/


add constraint FK_ASSAYGRO_REFERENCE_STUDYDES foreign key

(STUDY_DESIGN_ID)

references STUDYDESIGN (STUDY_DESIGN_ID)

/


add constraint FK_ASSAYGRO_REFERENCE_STUDYFAC foreign key


references STUDYFACTORVALUE (STUDY_FACTOR_VALUE_ID)

/

alter table ASSAYLABELEDEXTRACT

add constraint FK_ASSAYLAB_ASSAY foreign key (ASSAY_ID)


/


add constraint FK_ASSAYLAB_CHANNEL foreign key (CHANNEL_ID)


/


add constraint FK_ASSAYLAB_LEX foreign key (LABELED_EXTRACT_ID)


/

alter table ASSAYPARAM

add constraint FK_ASSAYPARAM_ASSAY foreign key (ASSAY_ID)


/

alter table ASSAYPARAM

add constraint FK_ASSAYPARAM_PRTOPRM foreign key

(PROTOCOL_PARAM_ID)


/

alter table ASSAYPARAMPROT

add constraint FK_ASSAYPAR_REFERENCE_PROTEOME foreign key

(PROTEOME_ASSAY_ID)

references PROTEOMEASSAY (PROTEOME_ASSAY_ID)

/

alter table ASSAYPARAMPROT

add constraint FK_ASSAYPAR_REFERENCE_PROTOCOL foreign key

(PROTOCOL_PARAM_ID)

references PROTOCOLPARAM (PROTOCOL_PARAM_ID)

/

alter table BAND

add constraint FK_BAND_REFERENCE_PHYSICAL foreign key

(physicalGelSpot_ID)

references PHYSICALGELITEM (physicalGelItem_ID)

/

alter table BIOASSAYTREATMENT

add constraint FK_BIOASSAY_REFERENCE_ONTOLOGY foreign key

(TREATMENT_TYPE_ID)

references ONTOLOGYENTRY (ONTOLOGY_ENTRY_ID)

/


add constraint FK_BIOASSAY_REFERENCE_PROTEOME foreign key

(PROTEOME_ASSAY_ID)



/


add constraint FK_BIOASSAY_REFERENCE_PROTOCOL foreign key

(PROTOCOL_ID)

references PROTOCOL (PROTOCOL_ID)

/

alter table BIOMATERIALCHARACTERISTIC

add constraint FK_BMCHARAC_BIOMAT foreign key (BIO_MATERIAL_ID)


/

alter table BIOMATERIALCHARACTERISTIC

add constraint FK_BMCHARAC_ONTOLOGY foreign key

(ONTOLOGY_ENTRY_ID)


/

alter table BIOMATERIALIMP

add constraint FK_BIOMATERIALIMP15 foreign key

(LABEL_METHOD_ID)

references LABELMETHOD (LABEL_METHOD_ID) not deferrable

/

alter table BIOMATERIALIMP

add constraint FK_BIOMATTYPE_OE foreign key

(BIO_MATERIAL_TYPE_ID)


/

alter table BIOMATERIALMEASUREMENT

add constraint FK_BMM_BIOMATERIAL foreign key (BIO_MATERIAL_ID)


/


add constraint FK_BMM_ONTO foreign key (UNIT_TYPE_ID)


/


add constraint FK_BMM_TREATMENT foreign key (TREATMENT_ID)

references TREATMENT (TREATMENT_ID) not deferrable

/

alter table BOUNDARYPOINT

add constraint FK_BOUNDARY_REFERENCE_IDENTIFI foreign key

(spot_id)

references IDENTIFIEDSPOT (identified_spot_ID)

on delete cascade

/

alter table CHEMICALTREATMENT

add constraint FK_CHEMICAL_REFERENCE_BIOASSAY foreign key



/

alter table CHEMICALTREATMENT

add constraint FK_CHEMICAL_REFERENCE_ONTOLOGY foreign key

(treatment_type)


/

alter table COLLISIONCELL

add constraint FK_COLLISIO_REFERENCE_MZANALYS foreign key

(mz_analysis_ID)

references MZANALYSIS (mz_analysis_ID)

/

alter table COMPOSITEELEMENTANNOTATION

add constraint FK_CEANNOT_CE foreign key (COMPOSITE_ELEMENT_ID)

references COMPOSITEELEMENTIMP (COMPOSITE_ELEMENT_ID) not

deferrable

/

alter table COMPOSITEELEMENTGUS

add constraint FK_CEG_CE foreign key (COMPOSITE_ELEMENT_ID)


deferrable

/

alter table COMPOSITEELEMENTIMP

add constraint FK_CE_ARRAY foreign key (ARRAY_ID)


/

alter table COMPOSITEELEMENTIMP

add constraint FK_CE_CE foreign key (PARENT_ID)


deferrable

/

alter table COMPOSITEELEMENTRESULTIMP

add constraint FK_CERESULT_CELEMENT foreign key



deferrable

/

alter table COMPOSITEELEMENTRESULTIMP

add constraint FK_CERESULT_QUANT foreign key

(QUANTIFICATION_ID)

references QUANTIFICATION (QUANTIFICATION_ID) not deferrable

/

alter table CONTROL

add constraint FK_CONTROL_ASSAY foreign key (ASSAY_ID)



/

alter table CONTROL

add constraint FK_CONTROL_ONTO foreign key (CONTROL_TYPE_ID)


/

alter table DATABASEENTRY

add constraint FK_DATABASE_NAME_FK_DATABA_ONT foreign key

(database_name)


/


add constraint FK_DATABASE_URI_FK_DATABA_ONTO foreign key

(database_uri)


/


add constraint FK_DATABASE_VERS_FK_DATABA_ONT foreign key

(database_version)


/

alter table DBSEARCH

add constraint FK_DBSEARCH_REFERENCE_DBSEARCH foreign key


references DBSEARCHPARAMETERS (db_search_parameters_ID)

/

alter table DBSEARCH

add constraint FK_DBSEARCH_REFERENCE_PEAKLIST foreign key

(peak_list_ID)

references PEAKLIST (peak_list_ID)

/

alter table DBSEARCHPARAMETERS

add constraint FK_DBSEARCH_REFERENCE_PROTOCOL foreign key

(PROTOCOL_ID)


/

alter table DIGESINGLESPOT

add constraint FK_DIGESING_REFERENCE_IDENTIFI foreign key

(identified_spot_ID)


/


add constraint FK_DIGESING_REFERENCE_IMAGE_AN foreign key


references GELIMAGEANALYSIS (GEL_IMAGE_ANALYSIS_ID)

/


add constraint FK_DIGESING_REFERENCE_SPOT_MEA foreign key

(SPOT_MEASURES_ID)

references SPOTMEASURESIMP (SPOT_MEASURES_ID)

/

alter table ELECTROSPRAY

add constraint FK_ELECTROS_REFERENCE_IONSOURC foreign key

(ion_source_ID)

references IONSOURCE (ion_source_ID)

/

alter table ELEMENTANNOTATION

add constraint FK_ELEANNOT_ELEMENTIMP foreign key (ELEMENT_ID)

references ELEMENTIMP (ELEMENT_ID) not deferrable

/

alter table ELEMENTIMP

add constraint FK_ELEMENT_ARRAY foreign key (ARRAY_ID)


/


add constraint FK_ELEMENT_COMPELEFAM foreign key



deferrable

/


add constraint FK_ELEMENT_ONTO foreign key (ELEMENT_TYPE_ID)


/

alter table ELEMENTRESULTIMP

add constraint FK_ELEMENTRESULT_ELEMENTIMP foreign key

(ELEMENT_ID)

references ELEMENTIMP (ELEMENT_ID) not deferrable

/


add constraint FK_ELEMENTRESU_QUANT foreign key

(QUANTIFICATION_ID)


/


add constraint FK_ELEMENTRES_SFR foreign key

(COMPOSITE_ELEMENT_RESULT_ID)

references COMPOSITEELEMENTRESULTIMP

(COMPOSITE_ELEMENT_RESULT_ID) not deferrable

/


alter table FRACTION

add constraint FK_FRACTION_REFERENCE_BIOMATER foreign key

(BIO_MATERIAL_ID)


/

alter table FRACTION

add constraint FK_FRACTION_REFERENCE_LCCOLUMN foreign key

(LCColumn_ID)


/

alter table GEL1D

add constraint FK_GEL1D_REFERENCE_BIOASSAY foreign key



/

alter table GEL2D

add constraint FK_GEL2D_REFERENCE_BIOASSAY foreign key



/

alter table GRADIENTSTEP

add constraint FK_GRADIENT_REFERENCE_LCCOLUMN foreign key

(GradientStep_ID)


/

alter table HEXAPOLE

add constraint FK_HEXAPOLE_REFERENCE_MZANALYS foreign key

(mz_analysis_ID)


/

alter table IDENTIFIEDSPOT

add constraint FK_IDENTIFI_REFERENCE_IMAGE_AN foreign key



/


add constraint FK_IDENTIFI_REFERENCE_PHYSICAL foreign key

(physicalGelItem_ID)

references PHYSICALGELITEM (physicalGelItem_ID)

/


add constraint FK_IDENTIFI_REFERENCE_SPOT_MEA foreign key

(SPOT_MEASURES_ID)

references SPOTMEASURESIMP (SPOT_MEASURES_ID)

/

alter table IMAGEACQUISITION

add constraint FK_IMAGE_AC_REFERENCE_CHANNEL foreign key

(CHANNEL_ID)

references CHANNEL (CHANNEL_ID)

/

alter table IMAGEACQUISITION

add constraint FK_IMAGE_AC_REFERENCE_PROTEOME foreign key

(PROTEOME_ASSAY_ID)


/

alter table GELIMAGEANALYSIS

add constraint FK_IMAGE_AN_REFERENCE_IMAGE_AC foreign key

(ACQUISITION_ID)

references IMAGEACQUISITION (IMAGE_ACQUISITION_ID)

/

alter table GELIMAGEANALYSIS

add constraint FK_IMAGE_AN_REFERENCE_PROTOCOL foreign key

(PROTOCOL_ID)


/

alter table INTEGRITYSTATINPUT

add constraint INTEGRITYSTATINPUT_FK01 foreign key

(INTEGRITY_STATISTIC_ID)

references INTEGRITYSTATISTIC (INTEGRITY_STATISTIC_ID) not

deferrable

/

alter table IONTRAP

add constraint FK_IONTRAP_REFERENCE_MZANALYS foreign key

(mz_analysis_ID)


/

alter table LABELMETHOD

add constraint FK_LABELEDMETHOD_PROTO foreign key (PROTOCOL_ID)


/

alter table LABELMETHOD

add constraint FK_LABELMETHOD_CHANNEL foreign key (CHANNEL_ID)


/

alter table LCCOLUMN

add constraint FK_LCCOLUMN_REFERENCE_BIOASSAY foreign key



/

alter table LISTPROCESSING

add constraint FK_LISTPROC_REFERENCE_PEAKLIST foreign key

(peak_list_ID)



/

alter table MAGEDOCUMENTATION

add constraint FK_MDOC_MAGEML foreign key (MAGE_ML_ID)

references MAGEML (MAGE_ML_ID) not deferrable

/

alter table MALDI

add constraint FK_MALDI_REFERENCE_IONSOURC foreign key

(ion_source_ID)


/

alter table MASSSPECEXPERIMENT

add constraint FK_MASSSPEC_REFERENCE_BIOASSAY foreign key



/

alter table MASSSPECEXPERIMENT

add constraint FK_MASSSPEC_REFERENCE_MASSSPEC foreign key

(MSMachineID)

references MASSSPECMACHINE (mass_spec_machine_ID)

/

alter table MASSSPECMACHINE

add constraint FK_MASSSPEC_REFERENCE_IONSOURC foreign key

(ion_source_ID)


/

alter table MATCHEDSPOTS

add constraint FK_MATCHED__REFERENCE_MULTIPLE foreign key

(multiple_analysis_ID)

references MULTIPLEANALYSIS (multiple_analysis_ID)

/

alter table MATCHEDSPOTS

add constraint FK_MATCHED__REFERENCE_IDENTIFI foreign key

(identified_spot_ID)


/

alter table MOBILEPHASECOMPONENT

add constraint FK_MOBILEPH_REFERENCE_LCCOLUMN foreign key

(lc_column)


/

alter table MSMSFRACTION

add constraint FK_MSMSFRAC_REFERENCE_PEAKLIST foreign key

(peak_list_ID)


/

alter table MULTIPLEANALYSIS

add constraint FK_MULTIPLE_REFERENCE_ONTOLOGY foreign key

(analysis_type)


/

alter table MULTIPLEANALYSIS

add constraint FK_MULTIPLE_REFERENCE_PROTOCOL foreign key

(PROTOCOL_ID)


/

alter table MULTIPLEANALYSISGELIA

add constraint FK_MULTIPLE_REFERENCE_MULTAG foreign key

(MULTIPLE_ANALYSIS_ID)

references MULTIPLEANALYSIS (MULTIPLE_ANALYSIS_ID)

/

alter table MULTIPLEANALYSISGELIA

add constraint FK_MULTIPLE_REFERENCE_MULTGIA foreign key



/

alter table MZANALYSIS

add constraint FK_MZANALYS_REFERENCE_DETECTIO foreign key

(detection_ID)

references DETECTION (detection_ID)

/

alter table ONTOLOGYENTRY

add constraint FK_ONTOLOGYENTRY_PARENT foreign key (PARENT_ID)


/

alter table OTHERIONISATION

add constraint FK_OTHERION_REFERENCE_IONSOURC foreign key

(ion_source_ID)


/

alter table OTHERIONISATION

add constraint FK_OTHERION_REFERENCE_ONTOLOGY foreign key

(ONTOLOGY_ENTRY_ID)


/

alter table OTHERMZANALYSIS

add constraint FK_OTHERMZA_REFERENCE_MZANALYS foreign key

(mz_analysis_ID)


/

alter table OTHERMZANALYSIS

add constraint FK_OTHERMZA_REFERENCE_ONTOLOGY foreign key


(ONTOLOGY_ENTRY_ID)


/

alter table PEAK

add constraint FK_PEAK_REFERENCE_PEAKLIST foreign key

(peak_list_ID)


/

alter table PEAKLIST

add constraint FK_PEAKLIST_REFERENCE_MASSSPEC foreign key

(mass_spec_experiment_ID)

references MASSSPECEXPERIMENT (mass_spec_experiment_ID)

/

alter table PEPTIDEHIT

add constraint FK_PEPTIDEH_REFERENCE_DATABASE foreign key

(database_entry_ID)

references DATABASEENTRY (database_entry_ID)

/

alter table PEPTIDEHIT

add constraint FK_PEPTIDEH_REFERENCE_DBSEARCH foreign key

(db_search_ID)

references DBSEARCH (db_search_ID)

/

alter table PERCENTX

add constraint FK_PERCENTX_REFERENCE_GRADIENT foreign key

(lc_column, GradientStep_ID)

references GRADIENTSTEP (lc_column, GradientStep_ID)

/

alter table PERCENTX

add constraint FK_PERCENTX_REFERENCE_MOBILEPH foreign key

(Percent_ID)

references MOBILEPHASECOMPONENT (id)

/

alter table PHYSICALGELITEM

add constraint FK_PHYSICAL_REFERENCE_BIOMATER foreign key

(BIO_MATERIAL_ID)


/


add constraint FK_PHYSICAL_REFERENCE_GEL1D foreign key

(Gel1D_ID)

references GEL1D (Gel1D_ID)

/


add constraint FK_PHYSICAL_REFERENCE_GEL2D foreign key (gel2D)

references GEL2D (Gel2D_ID)

/


add constraint FK_PHYSICAL_REFERENCE_PROTEINR foreign key

(ProteinRecord_ID)

references PROTEINRECORD (protein_record_ID)

/

alter table PROCESSIMPLEMENTATION

add constraint FK_PROCESSIMP_ONTO foreign key (PROCESS_TYPE_ID)


/

alter table PROCESSIMPLEMENTATIONPARAM

add constraint FK_PRCSIMPPARAM_PRCSIMP foreign key


references PROCESSIMPLEMENTATION (PROCESS_IMPLEMENTATION_ID)

not deferrable

/

alter table PROCESSINVOCATION

add constraint FK_PROCESS_PROCIMP foreign key


references PROCESSIMPLEMENTATION (PROCESS_IMPLEMENTATION_ID)

not deferrable

/

alter table PROCESSINVOCATIONPARAM

add constraint FK_PROCESSINVPARAM_PROCESSINV foreign key

(PROCESS_INVOCATION_ID)

references PROCESSINVOCATION (PROCESS_INVOCATION_ID) not

deferrable

/

alter table PROCESSINVQUANTIFICATION

add constraint FK_PROCESSINQUANT_P foreign key



deferrable

/

alter table PROCESSINVQUANTIFICATION

add constraint FK_PROCESSINQUANT_Q foreign key

(QUANTIFICATION_ID)


/

alter table PROCESSIO

add constraint FK_PROCESSEDRESULT21 foreign key

(OUTPUT_RESULT_ID)

references PROCESSRESULT (PROCESS_RESULT_ID) not deferrable

/

alter table PROCESSIO

add constraint FK_PROCESSIO_PROCESSINV foreign key




deferrable

/

alter table PROCESSIOELEMENT

add constraint FK_PROCESSIO foreign key (PROCESS_IO_ID)

references PROCESSIO (PROCESS_IO_ID) not deferrable

/

alter table PROCESSRESULT

add constraint FK_PROCESSRESULT_ONTO foreign key (UNIT_TYPE_ID)


/

alter table PROTEINHIT

add constraint FK_PROTEINH_REFERENCE_PEPTIDEH foreign key

(peptide_hit_ID)

references PEPTIDEHIT (peptide_hit_ID)

/

alter table PROTEINHIT

add constraint FK_PROTEINH_REFERENCE_PROTEINR foreign key

(ProteinRecord_ID)


/

ALTER TABLE PROTEINHIT ADD constraint FK_db_search_id foreign key

(db_search_ID)

references DBSEARCH (db_search_ID)

/

alter table PROTEINMODIFICATION

add constraint FK_PROTEINM_REFERENCE_ONTOLOGY foreign key

(modification_type)


/

alter table PROTEINMODIFICATION

add constraint FK_PROTEINM_REFERENCE_PROTEINR foreign key

(protein_record_ID)


/

alter table PROTEINRECORD

add constraint FK_PROTEINTAXONNAME foreign key (TAXON_NAME_ID)

references SRES_TAXONNAME (TAXON_NAME_ID)

/

alter table PROTEINRECORDENTRY

add constraint FK_PROTENTRYREC foreign key (protein_record_ID)


/

alter table PROTEINRECORDENTRY

add constraint FK_PROTENTRYDB foreign key (database_entry_ID)

references DATABASEENTRY (database_entry_ID)

/

alter table PROTEOMEASSAY

add constraint FK_PROTEOME_REFERENCE_PROTOCOL foreign key

(PROTOCOL_ID)


/

alter table PROTOCOL

add constraint FK_PROTOCOL_HDWTYPE_OE foreign key

(HARDWARE_TYPE_ID)


/


add constraint FK_PROTOCOL_PRTTYPE_OE foreign key

(PROTOCOL_TYPE_ID)


/


add constraint FK_PROTOCOL_SFWTYPE_OE foreign key

(SOFTWARE_TYPE_ID)


/

alter table PROTOCOLPARAM

add constraint FK_PROTOCOLPARAM_ONTO1 foreign key

(DATA_TYPE_ID)


/


add constraint FK_PROTOCOLPARAM_ONTO2 foreign key

(UNIT_TYPE_ID)


/


add constraint FK_PROTOCOLPARAM_PROTO foreign key (PROTOCOL_ID)


/

alter table QUADRUPOLE

add constraint FK_QUADRUPO_REFERENCE_MZANALYS foreign key

(mz_analysis_ID)


/

alter table QUANTIFICATION

add constraint FK_QUANT_ACQ foreign key (ACQUISITION_ID)


/

alter table QUANTIFICATION


add constraint FK_QUANT_PROTOCOL foreign key (PROTOCOL_ID)


/

alter table QUANTIFICATIONPARAM

add constraint FK_QUANTPARAM_PRTPRM foreign key

(PROTOCOL_PARAM_ID)


/

alter table QUANTIFICATIONPARAM

add constraint FK_QUANTPARAM_QUANT foreign key

(QUANTIFICATION_ID)


/

alter table RELATEDACQUISITION

add constraint FK_RELACQ_ACQ01 foreign key (ACQUISITION_ID)


/

alter table RELATEDACQUISITION

add constraint FK_RELACQ_ACQ02 foreign key

(ASSOCIATED_ACQUISITION_ID)


/

alter table RELATEDQUANTIFICATION

add constraint FK_RELQUANT_QUANT01 foreign key

(QUANTIFICATION_ID)


/

alter table RELATEDQUANTIFICATION

add constraint FK_RELQUANT_QUANT02 foreign key

(ASSOCIATED_QUANTIFICATION_ID)


/

alter table SPOTRATIO

add constraint FK_SPOTRATI_2_FK_DIGE_S_DIGES2 foreign key

(second_DIGESingleSpot_ID)

references DIGESINGLESPOT (DIGESingleSpot_ID)

/

alter table SPOTRATIO

add constraint FK_SPOTRATI_FK_DIGE_S_DIGESING foreign key

(first_DIGESingleSpot_ID)

references DIGESINGLESPOT (DIGESingleSpot_ID)

/

alter table STUDYASSAY

add constraint FK_STDYASSAY_ASSAY foreign key (ASSAY_ID)


/

alter table STUDYASSAY

add constraint FK_STDYASSAY_STDY foreign key (STUDY_ID)

references STUDY (STUDY_ID) not deferrable

/

alter table STUDYASSAYPROT

add constraint FK_STUDYASS_REFERENCE_PROTEOME foreign key

(PROTEOME_ASSAY_ID)


/

alter table STUDYASSAYPROT

add constraint FK_STUDYASS_REFERENCE_STUDY foreign key

(STUDY_ID)

references STUDY (STUDY_ID)

/

alter table STUDYBIOMATERIAL

add constraint FORMHELP_FK01 foreign key (STUDY_ID)


/

alter table STUDYBIOMATERIAL

add constraint FORMHELP_FK02 foreign key (BIO_MATERIAL_ID)


/

alter table STUDYDESIGN

add constraint FK_STDYDES_STDY foreign key (STUDY_ID)


/

alter table STUDYDESIGNASSAY

add constraint FK_STDYDESASSAY_ASSAY foreign key (ASSAY_ID)


/

alter table STUDYDESIGNASSAY

add constraint FK_STDYDESASSAY_STDYDES foreign key

(STUDY_DESIGN_ID)

references STUDYDESIGN (STUDY_DESIGN_ID) not deferrable

/

alter table STUDYDESIGNASSAYPROT

add constraint FK_STUDYDES_REFERENCE_STUDYDES foreign key

(STUDY_DESIGN_ID)

references STUDYDESIGN (STUDY_DESIGN_ID)

/

alter table STUDYDESIGNASSAYPROT

add constraint FK_STUDYDES_REFERENCE_PROTEOME foreign key

(PROTEOME_ASSAY_ID)



/

alter table STUDYDESIGNDESCRIPTION

add constraint FK_STDYDESDCR_STDYDES foreign key

(STUDY_DESIGN_ID)


/

alter table STUDYDESIGNTYPE

add constraint FK_STDYDESTYPE_ONTO foreign key

(ONTOLOGY_ENTRY_ID)


/

alter table STUDYDESIGNTYPE

add constraint FK_STDYDESTYPE_STDYDES foreign key

(STUDY_DESIGN_ID)


/

alter table STUDYFACTOR

add constraint FK_STDYFCTR_ONTO foreign key

(STUDY_FACTOR_TYPE_ID)


/

alter table STUDYFACTOR

add constraint FK_STDYFCTR_STDYDES foreign key

(STUDY_DESIGN_ID)


/

alter table STUDYFACTORVALUE

add constraint FK_STDYFCTRVAL_ASSAY foreign key (ASSAY_ID)


/


add constraint FK_STDYFCTRVAL_OEUNIT foreign key

(MEASUREMENT_UNIT_OE_ID)


/


add constraint FK_STDYFCTRVAL_OEVALUE foreign key

(VALUE_ONTOLOGY_ENTRY_ID)


/


add constraint FK_STDYFCTRVAL_STDYFCTR foreign key

(STUDY_FACTOR_ID)

references STUDYFACTOR (STUDY_FACTOR_ID) not deferrable

/

alter table STUDYFACTORVALUEPROT

add constraint FK_STUDYFAC_FK_SFV_ME_ONTOLOGY foreign key

(MEASUREMENT_UNIT_OE_ID)


/


add constraint FK_STUDYFAC_FK_STF_VA_ONTOLOGY foreign key

(VALUE_ONTOLOGY_ENTRY_ID)


/


add constraint FK_STUDYFAC_REFERENCE_PROTEOME foreign key

(PROTEOME_ASSAY_ID)


/


add constraint FK_STUDYFAC_REFERENCE_STUDYFAC foreign key

(STUDY_FACTOR_ID)

references STUDYFACTOR (STUDY_FACTOR_ID)

/

alter table TANDEMSEQUENCEDATA

add constraint FK_TANDEMSE_REFERENCE_DBSEARCH foreign key


references DBSEARCHPARAMETERS (db_search_parameters_ID)

/

alter table TOF

add constraint FK_TOF_REFERENCE_MZANALYS foreign key

(mz_analysis_ID)


/

alter table TREATEDANALYTE

add constraint FK_TREATEDA_REFERENCE_BIOMATER foreign key

(BIO_MATERIAL_ID)


/

alter table TREATEDANALYTE

add constraint FK_TREATEDA_REFERENCE_CHEMICAL foreign key

(chemical_treatment_ID)

references CHEMICALTREATMENT (chemical_treatment_ID)

/

alter table TREATMENT

add constraint FK_TREATMENT6 foreign key (BIO_MATERIAL_ID)


/

alter table TREATMENT


add constraint FK_TREATMENT7 foreign key (TREATMENT_TYPE_ID)


/

alter table TREATMENTPARAM

add constraint FK_TREATMENTPARAM_PRTOPRM foreign key

(PROTOCOL_PARAM_ID)


/

alter table TREATMENTPARAM

add constraint FK_TREATMENTPARAM_TREATMENT foreign key

(TREATMENT_ID)

references TREATMENT (TREATMENT_ID) not deferrable

/

Appendix D

Modelling and database storage of

difference gel data

D.1 Introduction

The focus of the thesis is to improve technology for the management and sharing of proteome

data arising from 2-DE and MS. Chapter 1 reports three case studies: (i) a host-parasite

interaction study, (ii) the study of changes in the proteome of cell culture with a knock-

out of the gene Raf-1, and (iii) the determination of the proteome of Trypanosoma brucei.

The three case studies were used to inform the development of Gla-PSI and the data from

case studies (i) and (iii) are stored in RAPAD, as reported in Chapters 6 and 7. Case

study (ii), performed at the Beatson Institute, focused on a difference gel electrophoresis

(DIGE) experiment to find differentially expressed proteins. However, the data from the

DIGE study did not become available for inclusion in RAPAD due to technical difficulties

with the experimental setup. DIGE is becoming a major technique in proteomic analysis

because it allows more accurate determination of relative protein volume between two or

more study groups than standard gel electrophoresis. As case studies (i) and (iii) utilised

standard 2-DE analysis, the main data sets used for testing the technology we developed

did not include DIGE data. The purpose of this Appendix is to demonstrate that Gla-PSI,

FGE-OM and RAPAD are capable of representing DIGE data.

Chapter 6 describes a study of the proteome of host cells when invaded with a para-

site compared with non-invaded host cells, measured using standard 2-DE. The experiments

have recently been extended to study the proteome using the DIGE technique. The follow-

ing section (Section D.1.1) briefly describes the experimental methodology and Section D.2

illustrates how such DIGE data can be represented in Gla-PSI. Section D.3 describes how

the same experiment can be captured in FGE-OM. The data has recently been added to

342

Appendix D. Modelling and database storage of difference gel data 343

Replicate Cy2 Cy3 Cy5

1 S Inf1 Non12 S Inf2 Non23 S Non3 Inf34 S Non4 Inf4

Table D.1: Experimental plan for Cy labelling of proteins in the DIGE experiment withToxoplasma gondii. S = pooled sample from all eight replicates, Inf1 = Infected samplereplicate 1, Non1 = Non-infected sample replicate 1.

RAPAD, as described in Section D.4, and can be viewed within the Gel Viewer.

D.1.1 Host-parasite responses

In this section, there is a brief outline of a study to elucidate the changes in the proteome

of a human cell culture when invaded with a parasite, compared with non-invaded cells.

The study was performed in the laboratory of Dr Jonathan Wastling at the Institute of

Biomedical and Life Sciences, University of Glasgow, and it was performed by Morag Nelson,

a PhD student. The DIGE investigation accompanies the standard 2-DE studies described

in Chapter 6. RAPAD aids the information retrieval task, the combination of data across

replicate gels and the comparison with microarray results. There are details about the

hypothesis of the investigation and the generation of samples in Chapter 6, however the

following experimental procedure was used for DIGE analysis.

Four biological replicates were performed (four infected HFF samples versus four non-

infected). The samples were labelled with Cy dyes as shown in Table D.1. A fifth gel was

run with pooled material from the non-infected and infected samples. The fifth gel was used

for generating samples for mass spectrometry (MS) to identify the proteins. The gels were

scanned and gel images loaded into DeCyderTM[74] software. The software performs spot

matching across a series of gels and quantifies the difference in fluorescence, corresponding to

the relative abundance of a particular protein between the infected and non-infected samples.

D.2 Gla-PSI

Gla-PSI is shown in Figure D.1 and the main classes that are used to store DIGE data are

boxed. Figure D.2 demonstrates how classes in Gla-PSI capture the parasitology experiment

described above. ExperimentDesign describes the purpose of the experiment (infected ver-

sus non-infected samples) and ExperimentParameters captures the replicates described in


IDEvidence

MassSpec

The stages preceding image analysis have been presented in models: MAGE http://www.mged.org and PEDRo http://pedro.man.ac.uk

Class A

Class B

New classes inthe model

Classes derived from MAGE or PEDRo

Legend

Database


Identifiable


All classes are subclasses of Identifiable and Describable (not shown). Therefore, all classes can have an identifier attached and be linked to annotation classes.

ScannedImage


ExternalReference

exportedFromServer : StringexportedFromDB : StringexportID : StringexportName : String

Describable

BibliographicReference


Database


Description


0..1 10..1 1

0..n1 0..n1

0..n

1

0..n

1

OntologyEntry

category : Stringvalue : Stringdescription : String 1..n 11..n 1

0..n

1

0..n

1

DatabaseEntry


1 0..n1 0..n

0..n

1

0..n

1

0..1 10..1 1

OntologyRef

0..11 0..11

Type

SpotRatio


DatabaseEntry


1

0..n

1

0..n

Parameter


0..n0..1 0..n0..1

DIGESingleSpot


0..1

2

0..1

2

0..n

0..1

0..n

0..1

Protein


0..n

0..1

0..n

0..1

Parameter


DIGESingleImage


0..n

0..1

0..n

0..1

0..n1 0..n1

SpotSets

ScannedImage


0..n

0..1

0..n

0..1

StatisticalAnalysis

software : Stringversion : Stringalgorithm : StringdataFile : StringanalysisType : String

Spot


0..1

0..n

0..1

0..n

1..n

0..1

1..n

0..1

0..n0..n 0..n0..n

SpotRefs

spotID : String0..n

0..1

0..n

0..1

1..n

0..1

1..n

0..1

2D-PAGE


0..n

0..1

0..n

0..1

1..n

1

1..n

1

DIGEAnalysis1..n1 1..n1

0..1

1

0..1

1

1

1

1

1

ImageAnalysis


0..n

1

0..n

1

0..n1 0..n1

Parameter

parameterType : StringparameterValue : StringparameterUnit : String 0..n

0..1

0..n

0..1

0..n

0..1

0..n

0..1

0..n

0..1

0..n

0..1

0..n 0..10..n 0..1

ExperimentDesign

ProteinPreparation

1

1

1

1

MultipleAnalysis

analysisType : String0..1

0..n

0..1

0..n

0..1

0..n

0..1

0..n

0..n

0..1

0..n

0..1

ExperimentParameters

1..n

1

1..n

1

1

1..n

1

1..n

0..1

0..n

0..1

0..n

MatchedSpots


0..n0..1

0..n0..1

1

1..n

1

1..n

0..n

1

0..n

1

0..n

0..1

0..n

0..1

Figure D.1: The Gla-PSI model. The boxed classes are discussed in the text.


Table D.1 and the gel used for spot picking. ProteinPreparation captures the method of

protein extraction, solubilisation and the attachment of Cy labels. Gla-PSI does not include

attributes for ExperimentDesign, ExperimentParameters or ProteinPreparation there-

fore it is left to individual systems designers to develop structures that adequately store the

information from their domain of interest. 2D-PAGE has attributes for storing the details

of gel separation. There are four gels described in this experiment that are part of the

DIGE analysis. Within the DIGE analysis there are three scanned images from each 2-D

gel, corresponding to scanning the gel at three fluorescence wavelengths (DIGESingleImage).

Each image produces a set of spots stored in DIGESingleSpot. DeCyder software can also

produce a composite image by combining the three images, and data from spots measured

on the composite image are stored in Spot. The composite image is stored in ScannedImage

(to the right of DIGESingleSpot in Figure D.2). Spots that have been matched across the

replicates are captured in the classes: MultipleAnalysis, MatchedSpots and SpotRefs. All

classes can be attached to Parameter for capturing attributes not represented explicitly in

the model. MultipleAnalysis also links to a separate instance of 2D-PAGE which captures

the gel used to generate a picklist for MS analysis. The database accession numbers for

the proteins identified can be stored in Protein, and the model allows for single spots to

be matched to more than one protein, as multiple proteins migrating to the same spot is

a common phenomenon in 2-DE. The image analysis software can match spots across the

set of DIGE gels and the picklist gel. The ID numbers of matched spots are captured in

SpotRefs, which links composite spots on DIGE gels to spots from the picklist gel, thereby

associating the protein identifications to the relative abundance values for each spot on the

DIGE gels.

D.3 FGE-OM

The workflow in Figure D.3 demonstrates the representation in FGE-OM of DIGE data from

the parasitology study described above. At the top level the experimental hypothesis is cap-

tured in the class Experiment, which is attached to ExperimentFactor (not shown) that

represents the difference between the infected and the non-infected cell lines. One instance

of Experiment is indirectly related (via BioAssay) to two instances of BioMaterial (one for

infected, one for non-infected). A series of treatments are performed on the BioMaterial

that includes pooling samples after they have been labelled with the Cy dyes. The output

of the set of treatments is the four samples shown in Table D.1 and the sample for the prep


DIGEAnalysis

DIGESingleImage

dyeLabel = Cy2

DIGESingleImage

dyeLabel = Cy3

DIGESingleImage

dyeLabel = Cy5

DIGESingleSpot

volume = 12643

DIGESingleSpot

volume = 3456

DIGESingleSpot

volume = 17006

2D−PAGE

separation details

ScannedImage

Scanning and image details(attached toall images)

MultipleAnalysis

MatchedSpots

SpotRefs

Capture of gel for spotpicking and identificationof proteins

ScannedImage

Scanning and image details

ImageAnalysis

Software, versionand parameters

2D−PAGE

separation details

1

*

1

*

1

1

1

*

1

* *

1

Spot

coordinates

1

ProteinPreparationExperimentDesign ExperimentParameters

volume

4 4 4

4

1

1

Captures detailsof samples and labelling system

5

1

1

Protein

Spot

coordinatesvolume

database ids

Collection ofcomposite spotsfrom one 2−D gel

*SpotSets

Clusters of spotsmatched acrossgels

(allows for multiple matches)

1

1

1

1

1

1

451

1 1 1

* 1

1

1

*

1

1

4

imagethe composite

ScannedImage

1

*

1

1

*

1

111

Figure D.2: A DIGE experiment represented in Gla-PSI. The boxes represent classes in themodel and the text in each box is a comment to describe the purpose of the class or exampleattributes and values. The lines indicate relationships between classes and the numbersrepresent the relative number of classes (cardinality) that participate in the relationship.


gel, which are represented in BioMaterial, and each BioMaterial is associated with an in-

stance of BioAssayTreatment. BioAssayTreatment is a superclass from which specific types

of treatment can inherit relationships. In this case, an instance of Gel2D is used to capture

the details of the two-dimensional separation. BioAssayTreatment is associated with the

PhysicalBioAssay class that is used for linking various classes together. ImageAcquisition

(scanning) is linked to a source PhysicalBioAssay and an output PhysicalBioAssay that

is related to the four images: three from scanning at the three fluorescence wavelengths and

one composite image (captured in Image, Channel and the image format in OntologyEntry).

The gel that is used for spot picking is modelled in the same way but the Channel class is

not required as the gel does not contain fluorescent labels and has not been scanned at a

particular wavelength. Gel image analysis is modelled by the general MAGE-OM derived

class FeatureExtraction and a more specific class GelImageAnalysis. The five different

gels are related to each other through the class MultipleAnalysis. MeasuredBioAssay re-

lates the image analysis event to the data via MeasuredBioAssayData (not shown). There

is a relationship to the class BioDataTuples that stores rows of gel spot data, with rela-

tionships to IdentifiedSpot and DIGESingleSpot. IdentifiedSpot represents composite

spot information (from the image combined across the three channels) and DIGESingleSpot

stores attributes of spots in the single channel images. Spots that are matched across more

than one gel, for example matched between the standard gel used for MS analysis and the

DIGE gels, are stored in MatchedSpots which is related to the class MultipleAnalysis.

IdentifiedSpot and DIGESingleSpot have various attributes that are measured by image

analysis such as relative volumes, ratios between the different channels and a spot’s co-

ordinates. Complete class diagrams showing all the attributes are displayed in Appendix

B.

D.4 RAPAD

The parasitology study described above has recently been entered into RAPAD. There are

17 gel images in total in the study, corresponding to four images from each of the four

DIGE gels (three different wavelengths and a composite image) plus a single gel image from

the prep gel used for MS identification. Figures D.4 displays screenshots of the prep gel

visualised in the Gel Viewer. The DeCyderTMsoftware calculates the relative volume ratio

between the two study conditions (infected versus non-infected) across all four DIGE gels.

The ratio of volumes is stored in the IdentifiedSpot table, linked to the prep gel (stored in


Experiment

BioMaterialTreatment

Gel2D BioAssayTreatment

Image

Channel FeatureExtraction

GelImageAnalysis

MultipleAnalysis

MatchedSpots

BioDataTuples

IdentifiedSpot

13

DIGESingleSpot

MeasuredBioAssay

PhysicalBioAssay

Picklist gel (non−DIGE) is stored in the same structures but does not require the Channel class

ImageAcquisition

ScanningprotocolPhysical

BioAssay

1

1

11

Composite spotinformation

spots matchedacross gels

scanningwavelength

link to OntologyEntry for imageformat

proteinsolubilisation and labelling

Hypothesis andparameters

Separationdetails

Protocol for gelimage analysis

1

1

11

1

1

*

*

1

1

*

1

2

1

1 5

Treatments produce 5BioMaterials (4 for DIGE1 for standard gel)

2

1 *

1

*

1

1

4

1

1

11

1 17

Figure D.3: A DIGE study represented in FGE-OM. The boxes represent classes in themodel and the text in each box is a comment to describe the purpose of the class. Thelines indicate relationships between classes and the numbers represent the relative numberof classes (cardinality) that participate in the relationship for this experiment.


Gel2D, ProteomeAssay, ImageAcquisition and GelImageAnalysis, described in Chapter

5). The organisation of the data in this way allows a simple visualisation of the proteins up

or down-regulated without needing to examine the entire series of images because the Gel

Viewer allows the user to perform searches for the spot volume. All the DIGE gel images

are stored in RAPAD within the same study and can also be loaded concurrently with the

prep gel in the Gel Viewer. Proteins with a volume greater than zero are present in higher

abundance in non-infected cells, and less than zero are in higher abundance in infected cells.

RAPAD contains microarray, standard 2-DE and DIGE data for HFF cells invaded with

T. gondii. This means that comparisons can be made between the level of gene expression

and protein abundance as measured by more than one technique. This allows for validation

of the experimental methodology, and the derivation of significant biological information

about the proteins modulated in response to parasite invasion of host cells.


B)

A)

Figure D.4: Relative protein abundance data calculated from DIGE can be viewed in theGel Viewer via the gel used for protein identification by MS. The user can query for proteinsdown-regulated (panel A) or proteins up-regulated (panel B) in the Gel Viewer.

Bibliography

[1] S. Abiteboul, S. Cluet, V. Christophides, T. Milo, G. Moerkotte, and J. Simeon. Query-ing Documents in Object Databases. Int. J. on Digital Libraries, 1:5–19, 1997.

[2] F. Achard, G. Vaysseix, and E.Barillot. XML, bioinformatics and data integration.Bioinformatics, 17:115–125, 2001.

[3] C. Adessi, C. Miege, C. Albrieux, and T. Rabilloud. Two-dimensional electrophoresis ofmembrane proteins: A current challenge for immobilized pH gradients. Electrophoresis,18:127–135, 1997.

[4] R. Aebersold and M. Mann. Mass spectrometry-based proteomics. Nature, 422:198–207, 2003.

[5] Affymetrix. http://www.affymetrix.com/.

[6] J. W. Ajioka, J. M. Fitzpatrick, and C. P. Reitter. Toxoplasma gondii genomics:shedding light on pathogenesis and chemotherapy. Expert Rev Mol Med., 2001:1–19,2001.

[7] F. Al-Shahrour, R. Diaz-Uriarte, and J. Dopazo. FatiGO: a web tool for findingsignificant associations of Gene Ontology terms with groups of genes. Bioinformatics,20:578–580, 2004.

[8] A. Alban, S. O. David, L. Bjorkesten, C. Andersson, E. Sloge, S. Lewis, and I. Cur-rie. A novel experimental design for comparative two-dimensional gel analysis: two-dimensional difference gel electrophoresis incorporating a pooled internal standard.Proteomics, 3:36–44, 2003.

[9] J. Allen, H. M. Davey, D. Broadhurst, J. K. Heald, J. J. Rowland, S. G. Oliver, andD. B Kell. High-throughput classification of yeast mutants for functional genomicsusing metabolic footprinting. Nat Biotechnol., 21:692–696, 2003.

[10] AllGenes: a web site providing access to an integrated database of known and predictedhuman and mouse genes. (version 6.0, 2003) Center for Bioinformatics, University ofPennsylvania. http://www.allgenes.org.

[11] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, andD. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein databasesearch programs. Nucleic Acids Res., 25:3389–3402, 1997.

[12] AmiGO. http://www.godatabase.org/.

[13] Analytical Information Markup Language (AnIML).http://animl.sourceforge.net/.

351

[14] Apache Xindice. http://xml.apache.org/xindice/.

[15] Applied Biosystems. http://www.appliedbiosystems.com.

[16] ArrayExpress at the EBI. http://www.ebi.ac.uk/arrayexpress/.

[17] G. Arrizabalaga and J. C. Boothroyd. Role of calcium during Toxoplasma gondiiinvasion and egress. Int J Parasitol., 34:361–368, 2004.

[18] ASTM International. http://www.astm.org.

[19] M. P. Atkinson, L. Daynes, M. J. Jordan, T. Printezis, and S. Spence. An OrthogonallyPersistent Java. SIGMOD Record, 25(4):68–75, 1996.

[20] G. Babnigg and C. S. Giometti. GELBANK: a database of annotated two-dimensionalgel electrophoresis patterns of biological systems with completed genomes. NucleicAcids Res., 32:D582–D585, 2004.

[21] A. Bahl, B. Brunk, R. L. Coppel, J. Crabtree, S. J. Diskin, M. J. Fraunholz, et al.PlasmoDB: The Plasmodium Genome Resource. An integrated database providingtools for accessing and analyzing mapping, expression, and sequence data (both finishedand unfinished). Nucleic Acids Res., 30:87–90, 2002.

[22] P. G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A. Brass. AnOntology for Bioinformatics Applications. Bioinformatics, 15:510–520, 1999.

[23] C. A. Ball, G. Sherlock, and H. Parkinson. An open letter to the scientific journals.Science, 298:539, 2002.

[24] C. A. Ball, G. Sherlock, and H. Parkinson. An open letter to the scientific journals.Bioinformatics, 18:1409, 2002.

[25] C. A. Ball, G. Sherlock, and H. Parkinson. An open letter to the scientific journals.The Lancet, 360:1019, 2002.

[26] M. P. Barrett. The fall and rise of sleeping sickness. The Lancet, 353:1113–1114, 1999.

[27] J. D. Barry. The relative significance of mechanisms of antigenic variation in Africantrypanosomes. Parasitology Today, 13:203–244, 1997.

[28] S. Bechhofer, I. Horrocks, C. Goble, and R. Stevens. OilEd: a reason-able ontology edi-tor for the semantic web. In Proceedings of KI2001, Joint German/Austrian conferenceon Artificial Intelligence, pages 396–408, 2001.

[29] C. J. Beckers, J. F. Dubremetz, O. Mercereau-Puijalon, and K. A. Joiner. The Tox-oplasma gondii rhoptry protein ROP 2 is inserted into the parasitophorous vacuolemembrane, surrounding the intracellular parasite, and is exposed to the host cell cy-toplasm. J Cell Biol., 127:947–961, 1994.

[30] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler. Gen-Bank. Nucleic Acids Res., 31:23–27, 2003.

[31] T. Berners-Lee and J. Hendler. Nature Debates: Scientific publishing on the ‘semanticweb’. http://www.nature.com/nature/debates/e-access/Articles/bernerslee.htm.

[32] BIND at Blueprint. http://www.blueprint.org/bind/bind.php.

352

[33] Bioinformatic Harvester, Collection of all human (non fragmented) SWALL proteinsand their cross references to the major bioinformatic databases.http://harvester.embl.de/.

[34] BioJava. http://www.biojava.org.

[35] I. J. Blader, I. D. Manger, and J. C. Boothroyd. Microarray analysis reveals previouslyunknown changes in Toxoplasma gondii -infected human cells. J Biol Chem., 276:24223–24231, 2001.

[36] B. Blagoev, I. Kratchmarova, S. E. Ong, M. Nielsen, L. J. Foster, and M. Mann. Aproteomics strategy to elucidate functional protein-protein interactions applied to EGFsignaling. Nat Biotechnol., 21:315–318, 2003.

[37] B. Boeckmann, A. Bairoch, R. Apweiler, M. C. Blatter, A. Estreicher, E. Gasteiger,et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.Nucleic Acids Res., 31:365–370, 2003.

[38] S. Boldt, U. H. Weidle, and W. Kolch. The role of MAPK pathways in the action ofchemotherapeutic drugs. Carcinogenesis, 23:1831–1838, 2002.

[39] S. Bowers and B. Ludascher. An Ontology-Driven Framework for Data Transformationin Scientific Workflows. In Proceeding of the International Workshop on Data Integra-tion in Life Sciences, Lecture Notes in Computer Science, volume 2994, pages 1–16,2004.

[40] Tim Bray. What is RDF? http://www.xml.com/pub/a/2001/01/24/rdf.html.

[41] A. Brazma, P. Hingamp, J. Quackenbush, G. Sherlock, P. Spellman, C. Stoeckert, et al.Minimum information about a microarray experiment (MIAME)-toward standards formicroarray data. Nat. Genet., 29:365–71, 2001.

[42] A. Brazma, A. Robinson, G. Cameron, and M. Ashburner. One-stop shop for microar-ray data - Is a universal, public DNA-microarray database a realistic goal? Nature,403:699–700, 2000.

[43] P. Buneman, M. Grohe, and C. Koch. Path Queries on Compressed XML. In Proceed-ings of 29th International Conference on Very Large Data Bases, Berlin, Germany,pages 141–152, 2003.

[44] Peter Buneman. Semistructured data. In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 117–121,1997.

[45] A. Burger, D. Davidson, and R. Baldock. Formalization of Mouse Embryo Anatomy.Bioinformatics, 20:259–267, 2004.

[46] E. Camon, M. Magrane, D. Barrell, D. Binns, W. Fleischmann, P. Kersey, et al. TheGene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT,TrEMBL, and InterPro. Genome Res., 13:662–672, 2003.

[47] D. Carlson. Modeling XML Applications with UML: Practical e-Business Applications.Addison-Wesley, 2001.

353

[48] S. Carr, R. Aebersold, M. Baldwin, A. Burlingame, K. Clauser, and A. Nesvizhskii. Theneed for guidelines in publication of peptide and protein identification data: WorkingGroup on Publication Guidelines for Peptide and Protein Identification Data. Mol CellProteomics., 3:531–533, 2004.

[49] V. B. Carruthers. Host cell invasion by the opportunistic pathogen Toxoplasma gondii .Acta Trop., 81:111–122, 2002.

[50] J. I. Castrillo and S. G. Oliver. Yeast as a Touchstone in Post-genomic Research:Strategies for Integrative Analysis in Functional Genomics. J Biochem Mol Biol.,37:93–106, 2004.

[51] CellML. http://www.cellml.org/.

[52] S. Celniker, D. Wheeler, B. Kronmiller, J. Carlson, A. Halpern, S. Patel, et al. Finish-ing a whole-genome shotgun: Release 3 of the Drosophila melanogaster euchromaticgenome sequence. Genome Biol., 3:research0079.1–0079.14, 2002.

[53] Chagas disease information. The UNICEF-UNDP-World Bank-WHO Special Pro-gramme for Research and Training in Tropical Diseases.http://www.who.int/tdr/diseases/chagas/diseaseinfo.htm.

[54] K. H. Cheung, K. White, and J. Hager. YMD: A microarray database for large-scale gene expression analysis. In Proceedings of the American Medical InformaticsAssociation Annual Symposium, pages 140–144, 2002.

[55] The Chipping Forecast. Supplement to Nat Genet., 21:1–60, 1999.

[56] S. Cho, S. G. Park, D. H. Lee, and B. Chul. Protein-protein Interaction Networks:from Interactions to Networks. J Biochem Mol Biol., 37:45–52, 2004.

[57] D. Christendat, A. Yee, A. Dharamsi, Y. Kluger, A. Savchenko, J. R. Cort, et al.Structural proteomics of an archaeon. Nat Struct Biol., 7:903–909, 2000.

[58] M. Clamp, D. Andrews, D. Barker, P. Bevan, G. Cameron, Y. Chen, et al. Ensembl2002: accommodating comparative genomics. Nucleic Acids Res., 31:38–42, 2003.

[59] J-M. Claverie. What If There Are Only 30,000 Human Genes? Science, 291:1255–1257,2001.

[60] C. E. Clayton. Life without transcriptional control? from fly to man and back again.EMBO J., 21:1881–1888, 2002.

[61] A. M. Cohen, K. Rumpel, G. H. Coombs, and J. M. Wastling. Characterisation ofglobal protein expression by two-dimensional electrophoresis and mass spectrometry:proteomics of Toxoplasma gondii . Int J Parasitol., 32:39–51, 2002.

[62] B. Cooper, N. Sample, M. J. Franklin, G. R. Hjaltason, and M. Shadmon. A FastIndex for Semistructured Data. In Proceedings of 27th International Conference onVery Large Data Bases, pages 341–350, 2001.

[63] Cprogramming.com - Your Resource for C++ Programming.http://www.cprogramming.com/.

[64] F. Crick. Central Dogma of Molecular Biology. Nature, 227:561–563, 1970.

354

[65] Database of Interacting Proteins (DIP). http://dip.doe-mbi.ucla.edu/.

[66] C. J. Date. An Introduction to Database Systems - Volume 1, 6th Edition. Addison-Wesley, 1995. DAT c 95:1 1.Ex.

[67] S. Davidson, J. Crabtree, B. Brunk, J. Schug, V. Tannen, G. C. Overton, and C. J.Stoeckert Jr. K2/Kleisli and GUS: Experiments in integrated access to genomic datasources. IBM Systems Journal, 40(2):512–531, 2001.

[68] S. B. Davidson, G. C. Overton, V. Tannen, and L. Wong. BioKleisli: A Digital Libraryfor Biomedical Researchers. Int. J. on Digital Libraries, 1:36–53, 1997.

[69] T. N. Davis. Protein localization in proteomics. Curr Opin Chem Biol., 8:49–53, 2004.

[70] DB2 published by IBM. http://www.ibm.com/.

[71] DBLP, Computer Science Bibliography. http://dblp.uni-trier.de/.

[72] The DDBJ/EMBL/GenBank Feature Table: Definition.http://www.ebi.ac.uk/embl/Documentation/FT definitions/feature table.html.

[73] S. V. de Avalos, I. J. Blader, M. Fisher, J. C. Boothroyd, and B. A. Burleigh. Immedi-ate/Early Response to Trypanosoma cruzi Infection Involves Minimal Modulation ofHost Cell Transcription. J. Biol. Chem., 277:639–644, 2002.

[74] DeCyderTMpublished by Amersham Biosciences. http://www.apbiotech.com/.

[75] The definition of Document Type Definition (DTD). http://www.w3.org/TR/REC-html40/sgml/dtd.html.

[76] J. DeRisi, L. Penland, P. O. Brown, M. L. Bittner, P. S. Meltzer, M. Ray, Y. Chen,Y. A. Su, and J. M. Trent. Use of a cDNA microarray to analyse gene expressionpatterns in human cancer. Nat Genet., 14:457–460, 1996.

[77] A. Deutsch, M. Fernandez, and D. Suciu. Storing semistructured data with STORED.In Proceedings of the 1999 ACM SIGMOD international conference on Managementof data, pages 431–442, 1999.

[78] M. Diehn, G. Sherlock, G. Binkley, H. Jin, J. C. Matese, and T. Hernandez-Boussard.SOURCE: a unified genomic resource of functional annotations, ontologies, and geneexpression data. Nucleic Acids Res., 31:219–223, 2003.

[79] H. Dlugonska, K. Dytnerska, G. Reichmann, S. Stachelhaus, and H. G. Fischer. To-wards the Toxoplasma gondii proteome: position of 13 parasite excretory antigens on astandardized map of two-dimensionally separated tachyzoite proteins. Parasitol Res.,87:634–637, 2001.

[80] DNA Data Bank of Japan. http://www.ddbj.nig.ac.jp/.

[81] A. Doan, P. Domingos, and A. Levy. Learning Source Descriptions for Data Integration.In Proceedings of the International Workshop on The Web and Databases (WebDB),page Learning Source Descriptions for Data Integration, 2000.

[82] Document Object Model (DOM). http://www.w3.org/DOM/.

[83] A. W. Dowsey, M. J. Dunn, and G. Z. Yang. The role of bioinformatics in two-dimensional gel electrophoresis. Proteomics, 3:1567–1596, 2003.

355

[84] B. Edde, J. Rossier, J-P. LeCaer, F. Desbruyeres, F. Gros, and P. Denoulet. Post-translational glutamylation of alpha-tubulin. Science, 247:83–85, 1990.

[85] R. Edgar, M. Domrachev, and A. E. Lash. Gene Expression Omnibus: NCBI geneexpression and hybridization array data repository. Nucleic Acids Res., 30:207–210,2002.

[86] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis anddisplay of genome-wide expression patterns. Proc Natl Acad Sci U S A., 95:14863–14868, 1998.

[87] N. M. El-Sayed, E. Ghedin, J. Song, A. MacLeod, F. Bringaud, C. Larkin, et al. Thesequence and analysis of Trypanosoma brucei chromosome II. Nucleic Acids Res.,31:4856–4863, 2003.

[88] The Electronic Statistics Textbook.http://www.statsoftinc.com/textbook/stathome.html.

[89] R. A. Elmasri and S. B. Navathe. Fundamentals of Database Systems, 3rd edition.Addison-Wesley, 2000.

[90] EMAP: The Edinburgh Mouse Atlas Project. http://genex.hgu.mrc.ac.uk/.

[91] The EMBL Nucleotide Sequence Database. http://www.ebi.ac.uk/embl/.

[92] EMBOSS. http://www.hgmp.mrc.ac.uk/Software/EMBOSS/.

[93] J. Eng and J. Yates. SEQUEST. http://fields.scripps.edu/sequest/.

[94] Ensembl Genome Browser. http://www.ensembl.org/.

[95] Ensembl Trace Server. http://trace.ensembl.org/.

[96] Enterprise Architect v 4.1, published by Sparx Systems.http://www.sparxsystems.com.au/.

[97] Entrez, The Life Sciences Search Engine. http://www.ncbi.nih.gov/Entrez/.

[98] Ettan DIGE: Fluorescence 2D Difference Gel Electrophoresis.http://www.amershambiosciences.com/proteomics/dige/.

[99] T. Etzold, A. Ulyanow, and P. Argos. SRS: Information Retrieval System for MolecularBiology Data Banks. Methods Enzymol., 266:114–128, 1996.

[100] eVOC: The Human Gene Expression VOCabulary.http://www.sanbi.ac.za/evoc/.

[101] Extensible Markup Language (XML). http://www.w3c.org/XML/.

[102] J. B. Fenn, M. Mann, C. K. Meng, S. F. Wong, and C. M. Whitehouse. Electrosprayionization for mass spectrometry of large biomolecules. Science, 246:64–71, 1989.

[103] S. B. Ficarro, M. L. McCleland, P. T. Stukenberg, D. J. Burke, M. M. Ross, J. Sha-banowitz, D. F. Hunt, and F. M. White. Phosphoproteome analysis by mass spec-trometry and its application to Saccharomyces cerevisiae. Nat Biotechnol., 20:301–305,2002.

356

[104] T. Fiebig, S. Helmer, C-C. Kanne, G. Moerkotte, J. Neumann, R. Schiele, and T. West-mann. Anatomy of a native XML base management system. VLDB J., 11:292–314,2002.

[105] O. Fiehn, J. Kopka, R. N. Trethewey, and L. Willmitzer. Identification of uncommonplant metabolites based on calculation of elemental compositions using gas chromatog-raphy and quadrupole mass spectrometry. Anal Chem., 72:3573–3580, 2000.

[106] H. I. Field, D. Fenyo, and R. C. Beavis. RADARS, a bioinformatics solution that auto-mates proteome mass spectral analysis, optimises protein identification, and archivesdata in a relational database. Proteomics, 2:36–47, 2002.

[107] S. Fields and O. Song. A novel genetic system to detect protein-protein interactions.Nature, 340:245–246, 1989.

[108] A. Fire, S. Xu, M. K. Montgomery, S. A. Kostas, S. E. Driver, and C. C. Mello. Potentand specific genetic interference by double-stranded RNA in Caenorhabditis elegans.Nature, 391:806–811, 1998.

[109] G. Fischer, S. M. Ibrahim, G. A. Brockmann, J. Pahnke, E. Bartocci, H-J. Thiesen,P. Serrano-Fernandez, and S. Moller. Expressionview: visualization of quantitativetrait loci and gene-expression data in Ensembl. Genome Biol., 4:R77, 2003.

[110] L. Florens, M. P. Washburn, J. D. Raine, R. M. Anthony, M. Grainger, J. D. Haynes,et al. A proteomic view of the Plasmodium falciparum life cycle. Nature, 419:520–526,2002.

[111] D. Florescu and D. Kossmann. Storing and Querying XML Data using an RDMBS.IEEE Data Engineering Bulletin, 22:27–34, 1999.

[112] FlyBase: A Database of the Drosophila Genome. http://www.flybase.org.

[113] R. Fogh, J. Ionides, E. Ulrich, W. Boucher, W. Vranken, J. P. Linge, et al. The CCPNproject: an interim report on a data model for the NMR community. Nat Struct Biol.,9:416–418, 2002.

[114] A. Freier, R. Hofestadt, M. Lange, U. Scholz, and A. Stephanik. BioDataServer: ASQL-based service for the online integration of life science data. In Silico Biol., 2:37–57,2002.

[115] B. Futcher, G. I. Latter, P. Monardo, C. S. McLaughlin, and J. I. Garrels. A samplingof the yeast proteome. Mol Cell Biol., 19:7357–7368, 1999.

[116] M. Gail, U. Gross, andW. Bohne. Transcriptional profile of Toxoplasma gondii -infectedhuman fibroblasts as revealed by gene-array hybridization. Mol Genet Genomics.,265:905–912, 2001.

[117] M. Y. Galperin. The Molecular Biology Database Collection: 2004 update. NucleicAcids Res., 32, Database issue:D3–D22, 2004.

[118] H. Garcia-Molina, J. Ullman, and J. Widom. Database Systems: The Complete Book.Prentice Hall, 2002.

[119] M Gardiner-Garden and T. G. Littlejohn. A comparison of microarray databases.Brief. Bioinformatics, 2:143–158, 2001.

357

[120] A. C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, and A. Bauer. Functionalorganization of the yeast proteome by systematic analysis of protein complexes. Nature,415:141–147, 2002.

[121] GenAtlas. http://www.genatlas.org/.

[122] Genbank. http://www.ncbi.nlm.nih.gov/Genbank/.

[123] Gene Expression Omnibus (GEO). http://www.ncbi.nlm.nih.gov/geo/.

[124] The Gene Ontology Consortium. http://www.geneontology.org/.

[125] The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology.Nat Genet., 25:25–29, 2000.

[126] The Gene Ontology Consortium. Creating the gene ontology resource: design andimplementation. Genome Res., 11:1425–1433, 2001.

[127] GeneDB. http://www.genedb.org.

[128] Generalized Analytical Markup Language. http://www.gaml.org.

[129] S. Gharbi, P. Gaffney, A. Yang, M. J. Zvelebil, R. Cramer, M. D. Waterfield, and J. F.Timms. Evaluation of two-dimensional differential gel electrophoresis for proteomicexpression analysis of a model breast cancer cell system. Mol Cell Proteomics., 1:91–98, 2002.

[130] M. Girolami and R. Breitling. Biologically valid linear factor models of gene expression.Bioinformatics, 20:3021–3033, 2004.

[131] G. V. Gkoutos, P. Murray-Rust, H. S. Rzepa, and M. Wright. Chemical markup, XMLand the World-Wide Web. 3. Toward a signed semantic chemical web of trust. J ChemInf Comput Sci., 41:1124–1130, 2001.

[132] The Global Grid Forum (GGF). http://www.gridforum.org/.

[133] C. A. Goble. The Semantic Web: A Killer App for AI? In Artificial Intelligence:Methodology, Systems, and Applications, 10th International Conference, AIMSA 2002,Varna, Bulgaria, pages 274–278, 2002.

[134] J. Gollub, C. A. Ball, G. Binkley, J. Demeter, D. B. Finkelstein, J. M. Hebert, et al.The Stanford Microarray Database: data access and quality assessment tools. NucleicAcids Res., 31:94–96, 2003.

[135] A. Gorg, C. Obermaier, G. Boguth, A. Harder, B. Scheibe, R. Wildgruber, andW. Weiss. The current state of two-dimensional electrophoresis with immobilized pHgradients. Electrophoresis, 21:1037–1053, 2000.

[136] A. Gorg, W. Postel, and S. Gunther. The current state of two-dimensional electrophore-sis with immobilized pH gradients. Electrophoresis, 9:531–546, 1988.

[137] P. R. Graves and T. A. Haystead. Molecular biologist’s guide to proteomics. MicrobiolMol Biol Rev., 66:39–63, 2002.

[138] T. R. Gruber. A translation approach to portable ontologies. Knowledge Acquisition,5:199–220, 1993.

358

[139] M. E. Guicciardi, J. Deussing, H. Miyoshi, S. F. Bronk, P. A. Svingen, C. Peters,S. H. Kaufmann, and G. J. Gores. Cathepsin B contributes to TNF-alpha-mediatedhepatocyte apoptosis by promoting mitochondrial release of cytochrome c. J ClinInvest., 106:1127–1137, 2000.

[140] K. Gull. The cytoskeleton of trypanosomatid parasites. Annu Rev Microbiol., 53:629–655, 1999.

[141] The GUS 3.0 schema. http://www.gusdb.org/cgi-bin/schemaBrowser.

[142] S. P. Gygi, B. Rist, S. A. Gerber, F. Turecek, M. H. Gelb, and R. Aebersold. Quan-titative analysis of complex protein mixtures using isotope-coded affinity tags. NatBiotechnol., 17:994–999, 1999.

[143] S. P. Gygi, Y. Rochon, B. R. Franza, and R. Aebersold. Correlation between proteinand mRNA abundance in yeast. Mol Cell Biol., 19:1720–30, 1999.

[144] L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar, J. E. Rice, and W. C. Swope.DiscoveryLink: A system for integrated access to life sciences data sources. IBMSystems Journal, 40:489–511, 2001.

[145] J. G. Hacia, L. C. Brody, M. S. Chee, S. P. Fodor, and F. S. Collins. Detectionof heterozygous mutations in BRCA1 using high density oligonucleotide arrays andtwo-colour fluorescence analysis. Nat Genet., 14:441–447, 1996.

[146] N. Hall, M. Berriman, N. J. Lennard, B. R. Harris, C. Hertz-Fowler, E. N. Bart-Delabesse, et al. The DNA sequence of chromosome I of an African trypanosome:gene content, chromosome organisation, recombination and polymorphism. NucleicAcids Res., 31:4864–4873, 2003.

[147] G. J. Hannon. RNA interference. Nature, 418:244–251, 2002.

[148] P. M. Haverty, Z. Weng, N. L. Best, K. R. Auerbach, L. L.i Hsiao, R. V. Jensen,and S. R. Gullans. Hugeindex: a database with visualization tools for high-densityoligonucleotide array data from normal human tissues. Nucleic Acids Res., 30:214–217, 2002.

[149] S. Hennig, D. Groth, and H. Lehrac. Automated gene ontology annotation for anony-mous sequence data. Nucleic Acids Res., 31:3712–3715, 2003.

[150] H. Hermjakob, L. Montecchi-Palazzi, G. Bader, J. Wojcik, L. Salwinski, A. Ceol,et al. The HUPO PSI’s molecular interaction format–a community standard for therepresentation of protein interaction data. Nat Biotechnol., 22:177–183, 2004.

[151] F. Hillenkamp and M. Karas. Mass spectrometry of peptides and proteins by matrix-assisted ultraviolet laser desorption/ionization. Methods Enzymol., 193:280–95, 1990.

[152] Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S. L. Adams, et al. Systematicidentification of protein complexes in Saccharomyces cerevisiae by mass spectrometry.Nature, 415:180–183, 2002.

[153] C. Hoogland, J. C. Sanchez, L. Tonella, P. A. Binz, A. Bairoch, D. F. Hochstrasser,and R. D. Appel. The 1999 SWISS-2DPAGE database update. Nucleic Acids Res.,28:286–288, 2000.

359

[154] I. Horrocks. DAML+OIL: a reason-able web ontology language. In Proceedings ofEDBT 2002, number 2287 in Lecture Notes in Computer Science, pages 2–13, March2002.

[155] M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, et al. The Sys-tems Biology Markup Language (SBML): A Medium for Representation and Exchangeof Biochemical Network Models. Bioinformatics, 19:524–531, 2003.

[156] HUGO Gene Nomenclature Committee (HGNC).http://www.gene.ucl.ac.uk/nomenclature/.

[157] W. K. Huh, J. V. Falvo, L. C. Gerke, A. S. Carroll, R. W. Howson, J. S. Weissman,and E. K. O’Shea. Global analysis of protein localization in budding yeast. Nature,425:686–691, 2003.

[158] Human-Mouse Homology Map. http://www.ncbi.nlm.nih.gov/Homology/.

[159] E. Hunt, E. Pafilis, I. Tulloch, and J. Wilson. Index-Driven XML Data Integration toSupport Functional Genomics. In Proceeding of the International Workshop on DataIntegration in Life Sciences, Lecture Notes in Computer Science, volume 2994, pages95–109, 2004.

[160] HUP-ML format is available as a DTD (Document Type Definition).http://www1.biz.biglobe.ne.jp/˜jhupo/HUP-ML/hup-ml.dtd.

[161] HUPO - The Human Proteome Organisation. http://www.hupo.org/.

[162] ImageMaster published by Amersham Biosciences. http://www.apbiotech.com/.

[163] Immunohistochemistry - In Situ Hybridization. http://home.no.net/immuno/.

[164] The International Human Genome Sequencing Consortium. Initial sequencing andanalysis of the human genome. Nature, 401:860–921, 2001.

[165] R. Jansen and M. Gerstein. Analysis of the yeast transcriptome with structural andfunctional categories: characterizing highly expressed proteins. Nucleic Acids Res.,28:1481–1488, 2000.

[166] Japanese Human Proteome Organisation (J-HUPO). http://www.jhupo.org/.

[167] Java 2 Platform, Standard Edition (J2SE), v1.4 Overview.http://java.sun.com/j2se/1.4/.

[168] Java Applet. http://java.sun.com/applets/.

[169] Java Technology. http://java.sun.com/.

[170] Java Web Start Technology. http://java.sun.com/products/javawebstart/.

[171] JavaScript.comTM- The Definitive JavaScript Resource.http://www.javascript.com/.

[172] O. N. Jensen. Modification-specific proteomics: characterization of post-translationalmodifications by mass spectrometry. Curr Opin Chem Biol., 8:33–41, 2004.

[173] T. K. Jenssen and E. Hovig. The semantic web and biology. Drug Discov Today.,7:992, 2002.

360

[174] A. Jones. A database for storing the results of 2D-PAGE experiments. Master’s thesis,University of Glasgow, 2001.

[175] A. Jones, E. Hunt, J. M. Wastling, A. Pizarro, and C. J. Stoeckert Jr. An object modeland database for functional genomics. Bioinformatics, 20:1583–1590, 2004.

[176] A. Jones, J. Wastling, and E. Hunt. Proposal for a standard representation of two-dimensional gel electrophoresis data. Comp. Funct. Genom., 4:492–501, 2003.

[177] K. R. Jonscher and J. R. Yates 3rd. The quadrupole ion trap mass spectrometer–asmall solution to a big challenge. Anal Biochem., 244:1–15, 1997.

[178] K. Kadota, D. Tominaga, R. Asai, and K. Takahashi. Correlation Analysis of mRNAand Protein Abundances in Human Tissues. Genome Lett., 2:139–148, 2003.

[179] D. E. Kalume, H. Molina, and A. Pandey. Tackling the phosphoproteome: tools andstrategies. Curr Opin Chem Biol., 7:64–9, 2003.

[180] M. Karas and F. Hillenkamp. Laser desorption ionization of proteins with molecularmasses exceeding 10,000 daltons. Anal Chem., 60:2299–2301, 1988.

[181] N. A. Karp, D. P. Kreil, and K. S. Lilley. Determining a significant change in pro-tein expression with DeCyderTMduring a pair-wise comparison using two-dimensionaldifference gel electrophoresis. Proteomics, 4:1421–1432, 2004.

[182] P. Karp, M. Riley, S. Paley, A. Pellegrini-Toole, and M. Krummenacker. EcoCyc:Electronic Encyclopedia of E. coli Genes and Metabolism. Nucleic Acids Res., 27:55–58, 1999.

[183] P. D. Karp. A strategy for database interoperation. J. Comput. Biol., 2:573–586, 1995.

[184] KEGG: Kyoto Encyclopedia of Genes and Genomes.http://www.genome.ad.jp/kegg/.

[185] K. Kim, D. Soldati, and J. C. Boothroyd. Gene replacement in Toxoplasma gondii withchloramphenicol acetyltransferase as selectable marker. Science, 262:911–914, 1993.

[186] K. Kim and L. M. Weiss. Toxoplasma gondii : the model apicomplexan. Int J Parasitol.,34:423–432, 2004.

[187] J. C. Kissinger, B. Gajria, L. Li, I. T. Paulsen, and D. S. Roos. ToxoDB: accessingthe Toxoplasma gondii genome. Nucleic Acids Res., 31:234–236, 2003.

[188] H. Kitano. Systems Biology: A Brief Overview. Science, 295:1662–1664, 2002.

[189] T. G. Kleno, C. M. Andreasen, H. O. Kjeldal, L. R. Leonardsen, T. N. Krogh, P. F.Nielsen, M. V. Sorensen, and O. N. Jensen. MALDI MS peptide mapping perfor-mance by in-gel digestion on a probe with prestructured sample supports. Anal Chem.,76:3576–3583, 2004.

[190] A. Kumar, P. M. Harrison, K. H. Cheung, N. Lan, N. Echols, P. Bertone, P. Miller,M. B. Gerstein, and M. Snyder. An integrated approach for finding overlooked genesin yeast. Nat Biotechnol., 20:58–63, 2002.

[191] J. Lee, S. Nam, S. B. Hwang, M. Hong, J. Y. Kwon, K. S. Joeng, S. H. Im, J. Shim,and M. C. Park. Functional genomic approaches using the nematode Caenorhabditiselegans as a model system. J Biochem Mol Biol., 37:107–113, 2004.

361

[192] J-H. Lee, D-E. Lee, B-U. Lee, and H-S. Kim. Global Analyses of Transcriptomes andProteomes of a Parent Strain and an L-Threonine-Overproducing Mutant Strain. JBacteriol., 185:5442–5451, 2003.

[193] M. G. Lee. The 3’ untranslated region of the hsp 70 genes maintains the level ofsteady state mRNA in Trypanosoma brucei upon heat shock. Nucleic Acids Res.,26:4025–4033, 1998.

[194] M. L. Lee, L. H. Yang, W. Hsu, and X. Yang. XClust: clustering XML schemas foreffective integration. In Proceedings of the 2002 ACM CIKM International Conferenceon Information and Knowledge Management, McLean, VA, USA, pages 292–299, 2002.

[195] A. J. Link, J. Eng, D. M. Schieltz, E. Carmack, G. J. Mize, D. R. Morris, B. M. Garvik,and J. R. Yates 3rd. Direct analysis of protein complexes using mass spectrometry.Nat Biotechnol., 17:676–682, 1999.

[196] C. M. Lloyd, M. D. B. Halstead, and P. F. Nielsen. CellML: its future, present andpast. Prog. Biophys. Mol. Biol., 85:433–450, 2004.

[197] G. W. Lubega, D. K. Byarugaba DK, and R. K. Prichard. Immunization with atubulin-rich preparation from Trypanosoma brucei confers broad protection againstAfrican trypanosomosis. Exp Parasitol., 102:9–22, 2002.

[198] R. E. Lyons, R. McLeod, and C. W. Roberts. Toxoplasma gondii : tachyzoite tobradyzoite interconversion. Trends Parasitol., 18:198–201, 2002.

[199] Macromedia. http://www.macromedia.com/.

[200] P. Mahon and P. Dupree. Quantitative and reproducible two-dimensional gel analysisusing Phoretix 2D Full. Electrophoresis, 22:2075–2085, 2001.

[201] H. Mamitsuka, Y. Okuno, and A. Yamaguchi. Mining biologically active patterns inmetabolic pathways using microarray expression profiles. ACM SIGKDD ExplorationsNewsletter, 5:113–121, 2003.

[202] E. Manduchi, G. R. Grant, H. He, J. Liu, M. D. Mailman, A. D. Pizarro, P. L.Whetzel, and C. J. Stoeckert Jr. RAD and the RAD Study-Annotator: an approachto collection, organization and exchange of all relevant information for high-throughputgene expression studies. Bioinformatics, 20:452–459, 2004.

[203] M. Mann, R. C. Hendrickson, and A. Pandey. Analysis of proteins and proteomes bymass spectrometry. Annu. Rev. Biochem., 70:437–473, 2001.

[204] M. Mann and O. N. Jensen. Proteomic analysis of post-translational modifications.Nat Biotechnol., 21:255–261, 2003.

[205] A. G. Marshall, C. L. Hendrickson, and G. S. Jackson. Fourier transform ion cyclotronresonance mass spectrometry: a primer. Mass Spectrom Rev., 17:1–35, 1998.

[206] C. J. Marshall. Specificity of receptor tyrosine kinase signaling: transient versus sus-tained extracellular signal-regulated kinase activation. Cell, 80:179–185, 1995.

[207] MASCOT, published by Matrix Science. http://www.matrixscience.com.

[208] M. H. Maurer, C. Berger, M. Wolf, C. D. Futterer, R. E. Feldmann Jr., S. Schwab, andW. Kuschinsky. The proteome of human brain microdialysate. Proteome Sci., 1(7),2003.

362

[209] S. M. Maurer, R. B. Firestone, and C. R. Scriver. Science’s neglected legacy. Nature,405:117–120, 2000.

[210] Melanie3 published by GeneBio. http://www.GeneBio.com/Melanie.html.

[211] The MGED Ontology. http://mged.sourceforge.net/ontologies/MGEDontology.php.

[212] Microarray Gene Expression Data Society (MGED). http://www.mged.org/.

[213] Microsoft .NET Information. http://www.microsoft.com/net/.

[214] O. A. Mirgorodskaya, Y. P. Kozmin, M. I. Titov, R. Korner, C. P. Sonksen, andP. Roepstorff. Quantitation of peptides and proteins by matrix-assisted laser des-orption/ionization mass spectrometry using (18)O-labeled internal standards. RapidCommun Mass Spectrom., 14:1226–1232, 2000.

[215] B. Modrek, A. Resch, C. Grasso, and C. Lee. Genome-wide detection of alternativesplicing in expressed sequences of human genes. Nucleic Acids Res., 29:2850–2859,2001.

[216] Molecular Visualization Resources: CHIME.http://www.umass.edu/microbio/chime/.

[217] M. P. Molloy. Two-Dimensional Electrophoresis of Membrane Proteins Using Immo-bilized pH Gradients. Anal Biochem., 280:1–10, 2000.

[218] The Mouse Anatomical Dictionary Browser.http://www.informatics.jax.org/searches/anatdict form.shtml.

[219] N. J. Mulder, R. Apweiler, T. K. Attwood, A. Bairoch, D. Barrell, A. Bateman, et al.The InterPro Database, 2003 brings increased coverage and new features. NucleicAcids Res., 31:315–318, 2003.

[220] P. Murray-Rust, H. S. Rzepa, M. J. Williamson, and E. L. Willighagen. Chemicalmarkup, XML, and the World Wide Web. 5. Applications of chemical metadata inRSS aggregators. J Chem Inf Comput Sci., 44:462–469, 2004.

[221] MySQL. http://www.mysql.com/.

[222] National Institute for Standards and Technology. http://www.nist.gov.

[223] C. Navarre, H. Degand, K. L. Bennett, J. S. Crawford, E. Mortz, and M. Boutry. Sub-proteomics: Identification of plasma membrane proteins from the yeast Saccharomycescerevisiae. Proteomics, 12:1706–1714, 2002.

[224] The NCBI Taxonomy Homepage.http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/.

[225] NCBI Trace Archive. http://www.ncbi.nlm.nih.gov/Traces/.

[226] W. Ni and T. W. Ling. GLASS: A Graphical Query Language for Semi-StructuredData. In Eighth International Conference on Database Systems for Advanced Applica-tions (DASFAA), pages 363–370, 2003.

[227] J. K. Nicholson, J. Connelly, J. C. Lindon, and E. Holmes. Metabonomics: a platformfor studying drug toxicity and gene function. Nat Rev Drug Discov., 1:153–161, 2002.

363

[228] M. Nilsson. The semantic web: How RDF will change learning technology standards,2001. http://www.cetis.ac.uk/content/20010927172953.

[229] N. Nirmalan, P. F. G. Sims, and J. E. Hyde. Quantitative proteomics of the humanmalaria parasite Plasmodium falciparum and its application to studies of developmentand inhibition. Mol Microbiol., 52:1187–1199, 2004.

[230] N. F. Noy, R. W. Fergerson, and M. A. Musen. The knowledge model of Protege-2000: Combining interoperability and flexibility. In 2th International Conference onKnowledge Engineering and Knowledge Management, pages 17–32, 2001.

[231] The Object Management Group. http://www.omg.org/.

[232] OPD: Open Proteomics Database. http://bioinformatics.icmb.utexas.edu/OPD/.

[233] Open Biological Ontologies (OBO). http://obo.sourceforge.net/.

[234] Open Grid Services Architecture Data Access and Integration (OGSA-DAI).http://www.ogsadai.org.uk/.

[235] Oracle 9i. http://www.oracle.com/.

[236] S. Orchard, P. Kersey, H. Hermjakob, and R. Apweiler. The HUPO Proteomics Stan-dards Initiative meeting: towards common standards for exchanging proteomics data.Comp Funct Genom, 4:16–19, 2003.

[237] S. Orchard, P. Kersey, W. Zhu, L. Montecchi-Palazzi, H. Hermjakob, and R. Apweiler.Progress in establishing common standards for exchanging proteomics data: The sec-ond meeting of the HUPO Proteomics Standards Initiative. Comp Funct Genom,4:203–206, 2003.

[238] OWL Web Ontology Language. http://www.w3.org/TR/owl-features/.

[239] H. Papageorgiou, F. Pentaris, E. Theodoruou, M. Vardaki, and M. Petrakos. Modelingstatistical metadata. In Proceedings of the 13th International Conference on Scientificand Statistical Database Management, pages 25–35, 2001.

[240] G. M. Pasinetti and L. Ho. From cDNA microarrays to high-throughput proteomics.Implications in the search for preventive initiatives to slow the clinical progression ofAlzheimer’s disease dementia. Restor Neurol Neurosci., 18:137–142, 2001.

[241] N. W. Paton, R. Stevens, P. G. Baker, C. A. Goble, S. Bechhofer, and A. Brass. QueryProcessing in the TAMBIS Bioinformatics Source Integration System. In Proceedings11th Int. Conf. on Scientific and Statistical Databases (SSDBM), pages 138–147, 1999.

[242] PEDRo (Proteomics Experiment Data Repository). http://pedro.man.ac.uk/.

[243] J. Peng, J. E. Elias, C. C. Thoreen, L. J. Licklider, and S. P. Gygi. Evaluation ofmultidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J Proteome Res., 2:43–050, 2003.

[244] C. A. Pereira, G. D. Alonso, H. N. Torres, and M. M. Flawia. Arginine kinase: acommon feature for management of energy reserves in African and American flagellatedtrypanosomatids. J Eukaryot Microbiol., 49:82–85, 2002.

364

[245] M. Perrot, F. Sagliocco, T. Mini, C. Monribot, U. Schneider, A. Shevchenko, M. Mann,P. Jeno, and H. Boucherie. Two-dimensional gel protein database of Saccharomycescerevisiae (update 1999). Electrophoresis, 20:2280–2298, 1999.

[246] PHP: Hypertext Processing. http://www.php.net.

[247] The Plant Ontology Consortium. http://www.plantontology.org/.

[248] PlasmoDB: The Plasmodium Genome Resource. http://www.plasmodb.org.

[249] Poseidon for UMLTM, available from Gentleware. http://www.gentleware.com.

[250] Powerdesigner 9TM, available from Sybase Inc. http://www.sybase.com.

[251] P. F. Predki. Functional protein microarrays: ripe for discovery. Curr Opin ChemBiol., 8:8–13, 2004.

[252] J. T. Prince, M. W. Carlson, R. Wang, P. Lu, and E. M. Marcotte. The need for apublic proteomics repository. Nat Biotechnol., 22:471–472, 2004.

[253] The Protein Data Bank. http://www.rcsb.org/pdb/.

[254] Protein Information Resource. http://pir.georgetown.edu/.

[255] Proteome 2D-PAGE database at Max-Planck.http://www.mpiib-berlin.mpg.de/2D-PAGE/.

[256] ProteomeGRID. http://vip.doc.ic.ac.uk/proteomegrid/.

[257] The Proteomics Standards Initiative. http://psidev.sourceforge.net/.

[258] PSI-MS XML Data Format. http://psidev.sourceforge.net/ms/.

[259] S. Purvine, A. F. Picone, and E. Kolker. Standard mixtures for proteome studies.OMICS, 8:79–92, 2004.

[260] X. Que, H. Ngo, J. Lawton, M. Gray, Q. Liu, J. Engel, et al. The cathepsin B ofToxoplasma gondii, toxopain-1, is critical for parasite invasion and rhoptry proteinprocessing. J Biol Chem., 277:25791–25797, 2002.

[261] The R Project for Statistical Computing. http://www.r-project.org/.

[262] RAD (RNA Abundance Database). http://www.cbil.upenn.edu/RAD/.

[263] J. C. Rain, L. Selig, H. De Reuse, V. Battaglia, C. Reverdy, S. Simon, et al. Theprotein-protein interaction map of Helicobacter pylori. Nature, 409:211–215, 2001.

[264] B. Raman, A. Cheung, and M. R. Marten. Quantitative comparison and evaluationof two commercially available, two-dimensional electrophoresis image analysis softwarepackages, Z3 and Melanie. Electrophoresis, 23:2194–2202, 2002.

[265] W. D. Ransom, P-C. Lao, D. A. Gage, and W. F. Boss. PhosphoglycerylethanolaminePosttranslational Modification of Plant Eukaryotic Elongation Factor 1 α. Plant Phys-iol., 117:949–960, 1998.

[266] Rational Rose 2000e, published by Rational Software.http://www.rational.com/.

365

[267] S. Raychaudhuri, J. Stuart, and R. Altman. Principal components analysis to sum-marize microarray experiments: application to sporulation time series. Pac SympBiocomput., 5:455–66, 2000.

[268] M. Rebhan, V Chalifa-Caspi, J. Prilusky, and D. Lancet. GeneCards: encyclopedia forgenes, proteins and diseases. Weizmann Institute of Science, Bioinformatics Unit andGenome Center (Rehovot, Israel).http://bioinformatics.weizmann.ac.il/cards.

[269] Resource Description Framework (RDF). http://www.w3.org/RDF.

[270] G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, M. Mann, and B. Seraphin. A genericprotein purification method for protein complex characterization and proteome explo-ration. Nat Biotechnol., 17:1030–1032, 1999.

[271] U. Roessner, C. Wagner, J. Kopka, R. N. Trethewey, and L. Willmitzer. Technicaladvance: simultaneous analysis of metabolites in potato tuber by gas chromatography-mass spectrometry. Plant J., 23:131–142, 2000.

[272] M. Rogers, J. Graham, and R. P. Tonge. Using statistical image models for objectiveevaluation of spot detection in two-dimensional gels. Proteomics, 3:879–86, 2003.

[273] D. S. Roos. Bioinformatics–trying to swim in a sea of data. Science, 291:1260–1261,2001.

[274] J. Rumbaugh, I. Jacobson, and G. Booch. The Unified Modeling Language ReferenceManual. Addison Wesley, 1999.

[275] L. H. Saal, C. Troein, J. Vallon-Christersson, S. Gruvberger, A. Borg, and C. Peterson.BioArray Software Environment: A Platform for Comprehensive Management andAnalysis of Microarray Data. Genome Biol., 3:software0003.1–0003.6, 2002.

[276] F. Sanger, G. M. Air, B. G. Barrell, N. L. Brown, A. R. Coulson, C. A. Fiddes, C. A.Hutchison, P. M. Slocombe, and M. Smith. Nucliotide sequence of bacteriophage phiX174 DNA. Nature, 265:687–695, 1977.

[277] V. Santoni, S. Kieffer, D. Desclaux, F. Masson, and T. Rabilloud. Membrane pro-teomics: use of additive main effects with multiplicative interaction model to classifyplasma membrane proteins according to their solubility and electrophoretic properties.Electrophoresis, 21:3329–3344, 2000.

[278] SASHIMI. http://sashimi.sourceforge.net/.

[279] SAX (Simple API for XML). http://sax.sourceforge.net/.

[280] R. A. Sayle and E. J. Milner-White. RasMol: Biomolecular graphics for all. TrendsBiochem Sci., 20:374–376, 1995.

[281] Scalable Vector Graphics (SVG). http://www.w3.org/Graphics/SVG/.

[282] D. G. Schmid, F. D. von der Mulbe, B. Fleckenstein, T. Weinschenk, and G. Jung.Broadband detection electrospray ionization Fourier transform ion cyclotron resonancemass spectrometry to reveal enzymatically and chemically induced deamidation reac-tions within peptides. Anal Chem., 73:6008–6013, 2001.

366

[283] A. Schneider, U. Plessmann, and K. Weber. Subpellicular and flagellar microtubulesof Trypanosoma brucei are extensively glutamylated. J Cell Sci., 110:431–437, 1997.

[284] L. V. Schneider and M. P. Hall. Stable Isotope Methods for High-Precision Proteomics.Drug Discov Today., in press, 2005.

[285] J. Seo and K-J. Lee. Post-translational modifications and their biological functions:Proteomic analysis and systematic approaches. J Biochem Mol Biol., 37:35–44, 2004.

[286] The Sequence Ontology Project. http://song.sourceforge.net/.

[287] D. Shalon, S. J. Smith, and P. O. Brown. A DNA microarray system for analyzingcomplex DNA samples using two-color fluorescent probe hybridization. Genome Res.,6:639–645, 1996.

[288] J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton.Relational Databases for Querying XML Documents: Limitations and Opportunities.In Proceedings of 25th International Conference on Very Large Data Bases, pages 302–314, 1999.

[289] T. Sherwin, A. Schneider, R. Sasse, T. Seebeck, and K. Gull. Distinct localization andcell cycle dependence of COOH terminally tyrosinolated alpha-tubulin in the micro-tubules of Trypanosoma brucei brucei . J Cell Biol., 104:439–446, 1987.

[290] Y. Shi, R. Xiang, C. Horvath, and J. A. Wilkins. The role of liquid chromatographyin proteomics. J Chromatogr A., 1053:27–36, 2004.

[291] L. D. Sibley. Intracellular Parasite Invasion Strategies. Science, 304:248–253, 2004.

[292] A. P. Sinai, T. M. Payne, J. C. Carmen, L. Hardi, S. J. Watson, and R. E. Molestina.Mechanisms underlying the manipulation of host apoptotic pathways by Toxoplasmagondii . Int J Parasitol., 34:381–391, 2004.

[293] Sir Henry Wellcome Functional Genomics Facility (SHWFGF), based in the Universityof Glasgow. http://www.gla.ac.uk/functionalgenomics/.

[294] D. H. Smith, J. Pepin, and A. H. Stich. Human African trypanosomiasis: an emergingpublic health crisis. Br Med Bull., 54:341–355, 1998.

[295] W. Smyth. Computing Patterns in Strings. Addison-Wesley, 2003.

[296] SourceForge.net: Project Info - Life Science Identifier (LSID).http://sourceforge.net/projects/lsid/.

[297] P. T. Spellman, M. Miller, J. Stewart, C. Troup, U. Sarkans, S. Chervitz, et al. Designand implementation of microarray gene expression markup language (MAGE-ML).Genome Biol., 23, 2002. RESEARCH0046.

[298] Standards and Ontologies for Functional Genomics. http://www.sofg.org/.

[299] L. D. Stein. Integrating biological databases. Nat Rev Genet., 4:337–345, 2003.

[300] R. D. Stevens, A. J. Robinson, and C. A. Goble. myGrid: personalised bioinformaticson the information grid. Bioinformatics, 19:I302–I304, 2003.

[301] A. Stich, P. M. Abel, and S. Krishna. Human African trypanosomiasis. BMJ, 325:203–206, 2002.

367

[302] C. Stoeckert, A. Pizarro, E. Manduchi, M. Gibson, B. Brunk, J. Crabtree, J. Schug,S. Shen-Orr, and G. C. Overton. A relational schema for both array-based and SAGEgene expression experiments. Bioinformatics, 417:300–308, 2001.

[303] C. J. Stoeckert, H. C. Causton, and C. A. Ball. Microarray databases: standards andontologies. Nat Genet., 32:469–473, 2002.

[304] C. J. Stoeckert and H. Parkinson. The MGED ontology: a framework for describingfunctional genomics experiments. Comp. Funct. Genom., 4:127–132, 2003.

[305] E. C. Strauss, J. A. Kobori, G. Siu, and L. E. Hood. Specific-primer-directed DNAsequencing. Anal Biochem., 154:353–360, 1986.

[306] L. W. Sumner, P. Mendes, and R. A. Dixon. Plant Metabolomics: Large-scale Phyto-chemistry in the Functional Genomics Era. Phytochemistry, 62:817–836, 2003.

[307] Sun Microsystems, Inc. http://www.sun.com/.

[308] Y. H. Sung, J. Song, and H-W. Lee. Functional Genomics Approach Using Mice. JBiochem Mol Biol., 37:122–132, 2004.

[309] SWISS-2DPAGE: Two-dimensional polyacrylamide gel electrophoresis database.http://ca.expasy.org/ch2d/.

[310] Swiss-Prot. http://www.expasy.ch/sprot/.

[311] The Systems Biology Markup Language. http://sbml.org/.

[312] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. Lander,and T. Golub. Interpreting gene expression with self-organizing maps: Methods andapplication to hematopoietic differentiation. Proc Natl Acad Sci U S A., 96:2907–2912,1999.

[313] Tamino XML server. http://www.softwareag.com/tamino/.

[314] T. A. Tatusova, L. Karsch-Mizrachi, and J. A. Ostell. Complete genomes in WWWEntrez: data representation and analysis. Bioinformatics, 15:536–543, 1999.

[315] C. F. Taylor, N. W. Paton, K. L. Garwood, P. D. Kirby, D. A. Stead, Z. Yin, et al.A systematic approach to modeling, capturing, and disseminating proteomics experi-mental data. Nat. Biotechnol., 21:247–254, 2003.

[316] S. W. Taylor, E. Fahy, B. Zhang, G. M. Glenn, D. E. Warnock, S. Wiley, et al.Characterization of the human heart mitochondrial proteome. Nat Biotechnol., 21:281–286, 2003.

[317] D. E. Terry and D. M. Desiderio. Between-gel reproducibility of the human cere-brospinal fluid proteome. Proteomics, 3:3, 2003.

[318] J. D. Thompson, D. G. Higgins, and T. J. Gibson. CLUSTAL W: improving the sensi-tivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22:4673–4680,1994.

[319] P. Toronen, M. Kolehmainen, G. Wong, and E. Castren. Analysis of gene expressiondata using self-organizing maps. FEBS, 451:142–146, 1999.

368

[320] ToxoDB : The Toxoplasma Genome Resource. http://www.toxodb.org/.

[321] Toxoplasma Genome Page. www.ebi.ac.uk/parasites/toxo/toxpage.html.

[322] M. Tyers and M. Mann. From genomics to proteomics. Nature, 422:193–197, 2003.

[323] P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, et al.A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae.Nature, 403:623–627, 2000.

[324] Unified Modeling Language. http://www.uml.org/.

[325] UniParc, The UniProt Archive. http://www.ebi.ac.uk/uniparc/.

[326] UniProt (Universal Protein Resource). http://www.uniprot.org.

[327] M. Unlu, M. E. Morgan, and J. S. Minden. Difference gel electrophoresis: a single gelmethod for detecting changes in cell extracts. Electrophoresis, 18:2071–2077, 1997.

[328] G. Van den Bergh, S. Clerens, F. Vandesande, and L. Arckens. Reversed-phase high-performance liquid chromatography prefractionation prior to two-dimensional differ-ence gel electrophoresis and mass spectrometry identifies new differentially expressedproteins between striate cortex of kitten and adult cat. Electrophoresis, 24:1471–1481,2003.

[329] F. J. van Deursen, S. K. Shahi, C. M. Turner, C. Hartmann, C. Guerra-Giraldez, K. R.Matthews, and C. E. Clayton. Characterisation of the growth and differentiation invivo and in vitro-of bloodstream-form Trypanosoma brucei strain TREU 927. MolBiochem Parasitol., 112:163–171, 2001.

[330] F. J. van Deursen, D. J. Thornton, and K. R. Matthews. A reproducible protocol foranalysis of the proteome of Trypanosoma brucei by 2-dimensional gel electrophoresis.Mol Biochem Parasitol., 128:107–110, 2003.

[331] S. Veeser, M. J. Dunn, and G. Z. Yang. Multiresolution image registration for two-dimensional gel electrophoresis. Proteomics, 1:856–870, 2001.

[332] V. E. Velculescu, L. Zhang, B. Vogelstein, and K. W. Kinzler. Serial analysis of geneexpression. Science, 270:484–487, 1995.

[333] V. E. Velculescu, L. Zhang, W. Zhou, J. Vogelstein, M. A. Basrai, D. E. Bassett Jr,P. Hieter, B. Vogelstein, and K. W. Kinzler. Characterization of the yeast transcrip-tome. Cell, 88:243–251, 1997.

[334] J. C. Venter, M. D. Adams, and E. W. Myers. The Sequence of the Human Genome.Science, 291:1304–1351, 2001.

[335] K. Vickerman. On the surface coat and flagellar adhesion in trypanosomes. Cell Sci.,5:163–194, 1969.

[336] E. O. Voit. Metabolic modeling: a tool of drug discovery in the post-genomic era.Drug Discov. Today, 7:621–628, 2002.

[337] C-W. von der Lieth, A. Bohne-Lang, K. K. Lohmann, and M. Frank. Bioinformaticsfor glycomics: Status, methods, requirements and perspectives. Brief. Bioinformatics,5:164–178, 2004.

369

[338] T. Voss and P. Haberl. Observations on the reproducibility and matching efficiencyof two-dimensional electrophoresis gels: consequences for comprehensive data analysis.Electrophoresis, 21:3345–3350, 2000.

[339] Voyager Version 5 with Data Explorer Software, published by Applied Biosystems.http://www.appliedbiosystems.com/.

[340] W3C Math home page. http://www.w3.org/Math/.

[341] W3C Recommendation for XML Schema. http://www.w3.org/XML/Schema.

[342] W3C Semantic Web. http://www.w3.org/2001/sw/.

[343] A. J. Walhout, R. Sordella, X. Lu, J. L. Hartley, G. F. Temple, M. A. Brasch,N. Thierry-Mieg, and M. Vidal. Protein interaction mapping in C. elegans usingproteins involved in vulval development. Science, 287:116–122, 2000.

[344] M. P. Washburn, D. Wolters, and J. R. Yates III. Large-scale analysis of the yeast pro-teome by multidimensional protein identification technology. Nat Biotechnol., 19:242–247, 2001.

[345] V. C. Wasinger, S. J. Cordwell, A. Cerpa-Poljak, J. X. Yan, A. A. Gooley, M. R.Wilkins, M. W. Duncan, R. Harris, K. L. Williams, and I. Humphery-Smith. Progresswith gene-product mapping of the Mollicutes: Mycoplasma genitalium. Electrophoresis,16:1090–1094, 1995.

[346] W. Weckwerth. Metabolomics in systems biology. Annu Rev Plant Biol., 54:669–689,2003.

[347] W. Weckwerth, V. Tolstikov, and O. Fiehn. Metabolomic characterization of transgenicpotato plants using GC/TOF and LC/MS analysis reveals silent metabolic phenotypes.In Proceedings of the 49th ASMS Conference on Mass spectrometry and Allied Topics,pages 1–2. Chicago: Am. Soc. Mass Spectrom., 2001.

[348] G. Wiederhold. Intelligent integration of diverse information (invited talk). In Int.Conf. on Information and Knowledge Management, Baltimore, 1992.

[349] M. R. Wilkins, J. C. Sanchez, A. A. Gooley, R. D. Appel, I. Humphery-Smith, D. F.Hochstrasser, and K. L. Williams. Progress with proteome projects: why all proteinsexpressed by a genome should be identified and how to do it. Biotechnol Genet EngRev., 13:19–50, 1996.

[350] WordNet - a lexical database for the English language.http://www.cogsci.princeton.edu/˜wn/.

[351] WORLD-2DPAGE: Index to 2-D PAGE databases and services.http://us.expasy.org/ch2d/2d-index.html.

[352] The World Wide Web Consortium. http://www.w3c.org/.

[353] WormBase. http://www.wormbase.org/.

[354] W. Xhou, B. A. Merrick, M. G. Khaledi, and K. B. Tomer. Detection and sequencingof phosphopeptides affinity bound to immobilized metal ion beads by matrix-assistedlaser desorption/ionization mass spectrometry. J Am Soc Mass Spectrom., 11:273–282,2000.

370

[355] S. Xirasagar, S. Gustafson, A. Merrick, K. B. Tomer, S. Stasiewicz, D. D. Chan,et al. CEBS Object Model for Systems Biology Data, CEBS MAGE SysBio-OM.Bioinformatics, 20:2004–2015, 2004.

[356] XML Metadata Interchange (XMI).http://www.omg.org/technology/documents/formal/xmi.htm.

[357] XQuery 1.0: An XML Query Language. http://www.w3.org/TR/xquery/.

[358] XSPAN - A Cross-Species Anatomy Project. http://www.xspan.org/.

[359] Xtect. http://xtect.cis.strath.ac.uk/.

[360] A. F. Yakunin, A. A. Yee, A. Savchenko, A. M. Edwards, and C. H. Arrowsmith.Structural proteomics: a tool for genome annotation. Curr Opin Chem Biol., 8:42–48,2004.

[361] W. Yan, H. Lee, E. C. Yi, D. Reiss, P. Shannon, B. K. Kwieciszewski, et al. System-based proteomic analysis of the interferon response in human liver cells. Genome Biol.,5:R54, 2004.

[362] M. Yanagida. Functional proteomics; current achievements. J Chromatogr B AnalytTechnol Biomed Life Sci., 771:89–106, 2002.

[363] X. Yang, M. L. Lee, and T. W. Ling. Resolving Structural Conflicts in the Integra-tion of XML Schemas: A Semantic Approach. In 22nd International Conference onConceptual Modeling (ER), pages 520–533, 2003.

[364] M. Yoshikawa, T. Amagasa, T. Shimura, and S. Uemura. XRel: a path-based ap-proach to storage and retrieval of XML documents using relational databases. ACMTransactions on Internet Technology, 1:110–141, 2001.

[365] N. Young, Z. Chang, and D. S. Wishart. GelScape: a web-based server for inter-actively annotating, manipulating, comparing and archiving 1D and 2D gel images.Bioinformatics, 20:976–978, 2004.

[366] Z3 published by Compugen. http://www.2dgels.com/.

[367] A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G. Ausiello, M. Helmer-Citterich,and G. Cesareni. MINT: a Molecular INTeraction database. FEBS Lett., 513:135–140,2002.

[368] B. R. Zeeberg, W. Feng, G. Wang, M. D. Wang, A. T. Fojo, M. Sunshine, et al.GoMiner: A Resource for Biological Interpretation of Genomic and Proteomic Data.Genome Biol., 4:R28, 2003.

[369] R. Zeng, H. Q. Ruan, X. S. Jiang, H. Zhou, L. Shi, L. Zhang, Q. H. Sheng, Q. Tu, Q. C.Xia, and J. R. Wu. Proteomic analysis of SARS associated coronavirus using two-dimensional liquid chromatography mass spectrometry and one-dimensional sodiumdodecyl sulfate-polyacrylamide gel electrophoresis followed by mass spectroemtric anal-ysis. J Proteome Res., 3:549–555, 2004.

[370] X. Zuo and D. W. Speicher. Comprehensive analysis of complex proteomes usingmicroscale solution isoelectrofocusing prior to narrow pH range two-dimensional elec-trophoresis. Proteomics, 2:58–68, 2002.

371