Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Integration, Warehousing, and Analysis Strategies of Omics Data
Srinubabu Gedela
Abstract
" -Otnics" i a current uffix for numerous type of large- calc biological data generation procedures, which naturally demand the development of novel algorithm for data storage and analy i . With next generation genome sequencing burgeoning, it i pivotal to decipher a coding ite on the genome, a gene' function , and information on transcript next to the pure availability of sequence information. To explore a genome and downstream molecular processes, we need umpteen result at the various levels of cellul ar o rganization by utilizing different experimental designs , data analysis strategies and methodologies. Here comes the need for controlled vocabularies and data integration to annotate, store, and update the fl ow of experimental data. T hi chapter explores key methodologies to merge Otnic data by semantic data carriers, discusses controlled vocabularies as cXtcn ible Markup Language (XMI_J), and provide practical guidance, databa c , and software links supporting the integration of Otnic data .
Living cell are organized around orne central aspect , including complex and integrated structure, regulatory mcchani m (a homeo ta i ), growth a11d development, c11crgy utilizati011, rc pon e to the enviro ntne11tal stimuli , reproductic)n (DNA guarai1tics (semi)exact replication), evoluti011 (capacity of livi11g et1ti ties to adapt over time), which all together arc reflected i11 ystems Biology. Latest research cxpa11ded toward a ~ystcn1S view of complex di ea es, al o inclt1di11g different pecies as, e.g., i11 tl1c
y tern Biology of l1o t- patl1ogen i11teractio11S. l)ata warehOLl e holding biological data of ge11e product , metabolite , a11d tno t
importa11t al o their relatio11 l1ip and biochemical orga11ization i11 Jnetabolic pathway , are a cc11tral prerequisite for such y tetn
Biology approaches. As an example, organizations as Biocyc (http://www.biocyc.org) provide tools for developing organism pecific metabolic pathway databa es from previously annotated
metabolite al o allowing the inclusion of newly annotated metabolites (1). Genome-wide or at lea t large-scale quantification of molecular components and experimental assessment of how these component interact have offered a broader insight into cellular function and into the effect of genetic and environmental perturbations. Omics data include quantification of mRNA transcripts (tran criptome), protein abundance (proteome), metabolic fluxes (fluxome ), the concentration profiles of intracellular and extracellular metabolite (metabolome), and information on protein- protein and protein- D A interactions (interactome).
A variety of method has been derived for data analysis, interpretation of phenomenological observations, and quantitative prediction of cellular behavior. These methods include comparative analy i of Omic profiling (e.g., tati tical te t and reduction of dimen ionality method ), model for integrative analysis (e.g., graph theory - based models), and predictive models (2, 3). Selection of the appropriate type of models capable of appropriately handling a given problem plays an important role in the extraction of knowledge, and experimental de ign should anticipate the planned data analysis trategy, and certainly follow a clear definition of the biological hypothesis.
Biological information extracted in Omics include all level of exploration of cellular activities, spanning from the gene to expre -ion and further to the phenotype level. Main data levels are
genomic , transcriptomics, proteomic , glycomic , lipidomic , metabolomics, and localizomics. The functional tates of the e Omics data are, e.g., phenomics and fluxomic , involving the effective e pression level reflected by Ornic data and flux required for metabolite in pathways. Further Omics levels describe interactions as protein- protein or protein- DNA interactions. Data flows describing various components of Omics within a cell are hown in Fig. l.
These Omics procedures in turn generate enormous amount of data which need to be stored in efficient ways. Large-scale information is readily available in genome-scale Ornics repositories while most of the dedicated databases store experimental data on the gene and protein level taken from various ources. A selected list of available data repositories, including a short description and URL, is provided in Table l.
Omics data are naturally retrieved from specific experimental procedures, and raw data as well as processed data and analysi results are represented in databases. Formally, the integration of Omics data can be described as an array in which all the individual array elements are interlinked. For respecting this fact, a design has to be chosen capable of representing both the experiments as
Integration, Warehousing, and Analysis Strategies of Omics Data 401
-
Metabolism j
• Phenome
Flux
Fig. 1. Schematic representation of different components of Omics data and information flow within a cell.
,.
2. Materials
such and their re Lilt vector . The experimental de ign noted i11 Fig. 2 span from the ge11ome level to equence a11110tation and furtl1er to ORF validation by, e.g., utilizing microarray a11d
AGE. ubseqtient experiment a proteomic provide, e.g., information on po t-translatio11al modification (PTM ) centrally used for enzyme a11110tation in metabolic pathways. Analysi of the c Omics level allow tt1dyi11g interaction 011 ge11e regulatory network a11d protei11- proteir1 i11tcraction 11envork ( cc
ote l ). Finally, hinctio11al a11notation co11tribute to an underta11dit1g of the overall expre io11 on tl1c gc11e level, a .. e.g., repreet1ted i11 OmicBrovv e (http://omic pacc.riken.jp/ omicBrow c )
ii1tercot111ecting different Omic.. data co1nponent level in a en1antic fa hion (4).
On1ic.. data ar fi:eque11tly rcpre entcd i11 \'Ocabtilaric , repre ei1t ~ct e.g. a l1ierar l1ical data el n1e11t.._ aL o pro\riding tl1e ir1ter
face tc) I1tin1eroti data r po itorie a11d a11aly i tool ( 5 ). For exa111plc tl1e protein 011tology ( l1ttp://I-1ir.georgctO\VI1.edu/ pro)
402 Gedela
Table 1 Major Omics data resources
Data types
Components
Genon1ic
Transcriptornlcs
Proteomics
Lipido1nics
Localizonucs
GJycomics
Interactions
Protein- D A
Protein- protein
Functional states
Pheno1nics
Online resource
Genomes OnLine Database (GOLD)
Gene Expression Omnibus (GEO}
tanford Microarray Database ( MD)
World-2DPAGE
Open Proteomics Database (OPD)
Lipid Metabolite and Pathways trategy (LIPID MAPS)
•
Yeast GFP Fusion Localization Database
Con ortium of functional Glycomics
Biomolecular etvvork Database (BI D )
Encyclopedia of D A Elements (E CODE)
Munich Information Center for Prot in equence (MIP )
Database of Interrtcting Proteins (DIP)
RNAi database
General Repo itory for Interaction Datasets (GR1D)
A Systematic Annotation Package For Community Analysi of Genoroes (i\ AP)
Description
Repository of completed and ongoing genome
• proJeCts
Nlicroarray and SAGEbased genome-wide expression profiJes
Microarray-based genomevvide expression data
Links to 2D-PAGE data
Mass-spectrometry-based proteomic data
Genome-scale lipids database
Yeast genome-scale protein-localization data
Glycan array and profile data
Publi bed protein- D A interactions
Databa e of functional elements in human D A
Links to protein- protein interaction data and re ources
Publi hed protein- protein interaction
C. elegan RNAi screen data
yntheric-lethal interac-. .
non 1n yeast
ingle ... gene--deletion microarray data tbr E. coli phenotypes
URL
http://~'W. genomesonli ne.org
http:// www.ncbi .nlm. nih .govjgeo
http:// genome wwvv.stanford.edu/
• nucroarray http://us.expasy.org/
ch2d/ 2d-index. html
http:/ /bioinformatic . icmb.utexas.edu/ OPD
http:/ / vvvvw.lipidmaps. org
http: I I yeastgfp. ucsf edu
http://W\V\v.functionalglycornics. org/
http: //v..ryvw. bind.ca/ Action/
http://genome.ucsc. edu/E CODE/ index.html
http://mips.gsf.del proj/ ppi
http:// dip .doe-mbi. ucla.edu
http: I I rnai.org
http:l/biodata. mshri. on.caj ¢d
http: I I \V\V\v~genolne .
\vi ~edujtool I asap.htm
Integration, Warehousing, and Analysis Strategies of Omics Data 403
Sequence Annotation
l$otoplc Tracing · ·~ferearray,
Genomlcs SAGE
1i r.anscrrJptomlcs
l?.henotype arr.ays, RNAI Screens,
Synthetic lethal.
Type of Expew.i mental
designs
Post Translational Modifications,
Mass Spectrometry, M~tDI~TOF
Pbenomics
Yeast ... 2f.t, Co .AP-MS &hlp.chip,
Gene~o~regulator.y, Networks
Proteomlcs
Metabolite Al)undance,
Enzyme annotation
Fig. 2. Types of experimental designs: ChiP-chip (chromatin-immunoprecipitation- DNA-microarray); co AP-MS (co-affinity purification-mass-spectrometry); RNAi (RNA interference); SAGE (serial analysis of gene expression); yeast 2H (yeast two-hybrid analysis).
•
2.1. eXtensible Markup Languages
is a standard for supporting data integration, data mining, and models for derivi11g protein structural a11d ftu1ctiona1 properties ( 6 ). The Gene 011tology (GO, http://www.get1contology.org) is a co11trollcd vocabulary to functionally an110tate gene product with re pect to their biological process, molecular function, and cellular location (7). Various framework l1ave been derived for implementing vocabularic , including eXte11 ible Markup Languages (XML), l{.esotlrce Descriptio11 Framework (RDF), Open Biomedical Ontologies (OBO), and OWLWeb Ontology La11guage (OWL).
XML is a general-ptlrpose markup lat1gt1age tl1at tlpports data haring across heterogeneous systems, and provides a format of
choice for storing information with a11 inherent hierarchical structure (see Note 2 ). XML has been widely accepted in tl1e Omics sciences as a standard for data exchange. Example for powerful XML-based data integration are the GLYcan Data Exchange (GLIDE, http://lsdis.cs. uga.edu/projects/ glycomics) enabling interoperability and exchange of glycomics data and more generally on structures carrying glycan moieties as developed by Sahoo
404 Gedela
2.2. Resource Description Framework
2.3. Open Biomedical Ontologies
2.4. OWLWeb Ontology Language
3. Methods
and his team (8), and BIOMART (http://www. biomart.org), an integration tool using XML syntax for building data elements from different databases providing user-defined queries (9 ).
RDF (http://www.w3.org/RDF) is a family ofWorld Wide Web Consortium (W3C) specifications originally designed as a metadata data model. RDF is used as a general syntax for linking a wide variety of data in a single framework. RDF is, e.g., used to combine genome data and public domain annotations within GO, KEGG, and the SUPERFAMILY database (10).
OBI (http://obi.sourceforge.net) is a collection of controlled vocabularies freely available to the biomedical community. Webbased ontology portals, such as the BioPortal (http://bioportal. bioontology.org) allow users to browse, search, submit, and visualize ontologies. The need for innovative technology and methods that allow scientists to record, manage, and disseminate biomedical information and knowledge in machine-processable form gave rise to the National Center for Biomedical Ontology (NCBO, http://www.bioontology.org) initiative created in 2005 ( ll ).
OWL (http://www.w3.org/TR/ owl-guide) facilitates further improved machine interpretability of Web content when compared to XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics. OWL has three sublanguages: OWL Lite, OWL DL, and OWL Full. They are described as
• OWL DL is an ontology language based on description logics (DLs).
• OWL Lite supports classification hierarchy and simple syntax.
• OWL Full with maximum expressiveness and syntactic freedom.
BioPAX (http://www.biopax.org) is an effort to create a data exchange format for biological pathway data utilizing OWL
• semantics.
•
Obviously, various methods for Omics data integration and analysis are available ( 12). First criteria for data integration depend upon the type of data; hence, most of the algorithms available are based on genomics, transcriptomics, and proteomics experimental data. Other Omics data components like phenomics and fluxomics can be studied through integration of further analysis tools of the integration environments.
a) Identifying Network Scaffold
Integration, W rehousing , nd Analysis Strategies of Omics Oat 405
_L __ _
Experimental Data
00 0
Whole cell or system model built by Biotapestry simulations
\ Network scaffold with Motif size-3
··-Input l t----T-....,.---
f,"l'"l I 1----1
0 0 0 0
b) Scaffold Decomposition
Transcription network built from Mdraw and Mfinder
c) Cellular modeling and analysis.
Fig. 3. A flow chart describing various steps of data integration toward network reconstruction and model building.
3. 1. Identifying, Decomposing, and Modeling
3. 1. 1. Identifying
a Network Scaffold
The procedure of identifying and decc)Jnp<>sing a network scaffold, fc>llc>wed by cellular systen1s tnodeling and analysis is scl1ematically de~1icted in fig. 3.
This task depicts the strategy f(>r identifying all interactic)ns between ()mics cc)m ~)onents. A typical exan1ple is the identification of a genc-regulatc>ry netwc)rk scaff<>ld by integrating chrc>matin imtnuno~1recipitati<>n (ChiP) and tnicrc)array gene expression data (referred tC) as c:hJI">-chi~) data ). Such ()mics data specify the interactic>tls l1etwccn a transcri~1ti<>na l regu latc)r and its target gene, atld various statistical apprc)aches arc avai lable t<> derive the specific regulatory rcJatic)nshi~1 ( natncl y, transcri~1tic>na J
activation or repression ). l)ata C)J1 J)rotein- l)NA and pr<)tcinprotein interactomes reflect the activity <)fa cellular nctwc>rk, and a typical analysis strategy fc)JJows clustering <)f high throughput gene expressio11 data sets, CC)mplemcntcd by isc)lating the upstreatn rcgic)tls of clustered genes fc)r identifyitlg C<)tnmon cis-rcgulatc>ry motifs.
Tools like Module constructic)n using gene exprcssic)n and seq uence motifs (MODEM) ( 13 ) and regulatory-elen1cnt detection using correlation with expression ( Rr~l)UCt~ , http:// busscmaker. bio.colum bia.edu/reducc) · ( 14) itnplcmented with scaffold bui lding algc)rithms based on the transcriptional mc)tifs
'
406 Gedela
3. 1.2. Network Scaffold Decomposition
3.1.3. Cellular Systems
Modeling and Analysis
found in clustered gene expression data are available. Another approach i Genetic regulatory modules (GRAM, http://psrg.lc . mit.edu/ GRAM/ Index.html) (15) for identifying protein- D A binding events within et oftran cription factor (see ote 3).
Integrating Omics data in network modules and aligning such module into more complete network i the common procedure in recon tructing network . etwork module re t on available interactome data and are typically composed of a limited number of node . Such identified motifs represent the basic building block that compri e the cellular network. The incorporation of localizomic data further upport i alation of biologically relevant motif , a interacting component are with higher probability found in the same ubcellular location.
Methods as Stati tical Analy i of etwork Dynamic (SANDY, http://sandy.topnet.ger teinlab.org/) , method for biclu ter analy i (SAMBA, http:/ j www.cs. tau.ac.il/%7Ershamir/ expanderj expander.html), and tool like Mdraw (http://www. weizmann.ac.il/ mcb/ UriAlon/ etworkMotif SW / mdraw /) and Mfinder (http://www.weizmann.ac.il/ mcb/ UrWon/
etworkMotifsSW / mfinder/ MfinderManual.pdf) are available for con tructing correlative map ( ee ote 4 ). A representative map con tructed by u ing Mdraw i depicted in Fig. 3.
The availability of Omic data set open the way for efforts aimed at integrating diver e Ornic profile into whole-cell or sy tern model , panning from identification of network module to quantitative modeling and imulation ( ee ote 5 ). The constraintbased reconstruction and analy i (COBRA, http://gcrg.uc d. edu/ D ownloads/ Cobra_Toolbox) technique (16) has emerged in recent year as a succe ful approach for modeling ystem on a genome cale integrating genomic, proteomic, and other high throughput data. Thi toolbox can be downloaded for Matlab.
Next to a quantitative description of a cellular state, Omics data may al o be seen in the context of overall con traints from thermodynamics, rna conservation, reactions involved, etc. A reconstruction i here defined a the list of biochemical reactions occurring in a particular cellular procedure (as metabolism), and the associations between these reaction and relevant proteins, transcripts, and genes. A reconstruction can be converted to a model by including the assumptions nece ary for computational simulation, for example, maximum reaction rates and nutrient uptake rates, which results in a reconstruction of the cellular process encoded within the omics data. Latest methods in developing such cellular simulations involve tools like Biotapestry (http:// www. biotapestry.org) ( 17). A ample proces done using Biotapestry (see Note 6) is depicted in Fig. 4.
•
lntegr tion Warehousing, nd Analysis Str t gies of Omics Oat 407
Network Reconstructed Reconstruction Network Application of constraints
States
Fig. 4. Constraint-Based Reconstruction and Analysis (COBRA) method.
3.2. DBE
'
The Data a11aly i and visualization systetn for Biological Experiment (DBE, http://www.bic-gl1.de/dbe) (18) describes a method for mapping metabolomics data i11to metabolite data, where DBE helps the scientists in managing, a11alyzit1g, a11d visualizing experimental data. DBE has a flow of compo11ents for handling omics data i11 mt1ltidisciplinary way. DBE-Web site provides the user interface, tl1e DEE-Database supports cot1sistc11t data storage, support of data import is realized via Excel-based templates, DEE-Pictures supports handling of, e.g., image files, and DBE-Gravisto provides network analysis a11d visualization. Selected components are shown in Fig. 5.
For demonstrating DBE functio11alities, we use metabolite data available for seed development of beans (Vi cia 11arbonet1sis ). In this case, transgenic tecl1nology was applied for increasing protein accumulation via introducing the bacterial enzyme phospl1oenolpyruvate carboxylase (PEFC). The enzyme refixes HC03- liberated by respiration, and together with PEFC yields oxaloacetate that can either be co11verted to aspartate or into malate and other intermediates of the citric acid cycle. To characterize tl1e responsible metabolic shift within seeds from sugars/starch into organic acids/amino acids/proteins, the metabolite pattern for glycolysis, citrate cycle as well as related sugars a11d free ami110 acids was analyzed. Visualization of tnetabolites within their pathways (Fig. 6) gives an immediate overview of specific changes in metabolism within transgenic seeds.
408 Gedela
.Oraefe ~i datal(a~,
storage ot metabolit-e data
Data In MS~~et
Individual Components of DBE
visual~e lfie Cfataenricfted n~twoiks
f o iG~iflfferQnt co ponents
online
u-pload abEl assign image files to experimentS
Fig. 5. DBE-Gravisto, a network analysis and graph visualization system.
3.3. 8/0MART BioMart (http://www.biomart.org) (9 ) is an open source data management system that comes with a range of query interfaces allowing the user to group and refine data based upon many different criteria. The capabilities of BioMart are further extended by integrating several widely used software packages, such as BioConductor, DAS, Galaxy, Cytoscape, or Taverna. BIOMART provides a graphical as well as command line interface, and furthermore Web services or APis written in Perl and Java supporting various database systems as MySQL, Oracle, and Postgres. Data integration involves four steps, namely, ( l ) querying, (2 ) configuration, (3) transformation, and (4 ) source data (Fig. 7 ).
Querying allows the user to select data, including filtering on the basis of attributes like the Gene ID or GO terms, providing a structured XML view. Configuration rests on XML for aligning heterogeneous data supporting structured querying. Transformation allows the data integration into the XML format from source data, and source data are available data sets which are parsed through PERL APis into a MySQL databases.
Three tier Architecture: First tier consists of one or more relational databases. Two
tools present in First tier are:
• Mart Builder to construct SQL statements for transforming a schema into a mart.
• Mart Editor for generating a data set configuration XML stored in metadata tables within the actual mart database.
lnt gr tion, W r housing, nd An lysis tr t i s of Omics 0 ta 409
-Galactos Trehalos
Maltose
-- w. UDPglucose
D-Xylose
Fucose ctose 6-ph shate
D-
L-Arginine L-Lysine
eta-Alan in
L-Ornithine -Giutamat L-Giycine
-Ascorbat
L-Serine
-Asparagin L -Aspartate
-Threonin L-Valine
L-Cysteine
-Threonin
L -Norvaline
•
-Isoleucin
Fig. 6. Visualization of experimental data in the context of a metabolic network constructed by using the DBE-Gravisto standalone version 1.1 (beta).
Second tier is the Perl API which interacts with bc>th , the data set configuration and the mart databases.
Third tier consists of the qLiery interfaces wl1ich utilize the API to present the possible BioMart queries a11d results:
• Mart View, a Web browser interface.
• Mart Service, a Web services i11tcrface.
• MartURLAccess, a mart view based 011 Web URL.
410 Gedela
Source Data
Transformation
XML XML XML Configuration
BioMart Software Querying
Fig. 7. Steps of data integration.
We show as practical example the analysis of the 1 kb upstream sequences of a cluster of human genes identified by an expression profile experiment using an Mfymetrix Genechip U95Av2.
The Homo sapiens genes data set is selected and filters of ID list limit in the GENE section is chosen. Selecting the Affy hg u95av2 ID(s) option provides an upload option for Affymetrix probeset IDs using the file Browse button, or alternatively by copy and paste of the data set into the text box. Data types i11clude complementary DNA (eDNA), peptides, coding regions, untranslated regions (UTRs), and exons with additional upstream and downstream flanking regions. In order to identify upstream regulatory features in subsequent analysis, the l kb upstream flank sequence for each gene has to be selected (Fig. 8). The subsequent data can be used for further annotation, e.g., by assigning GO terms to the Mfymetrix data via selecting respective filters and features attributes as shown in Fig. 9.
A number of external software packages have incorporated BioMart for enhancing querying capabilities, e.g., for using services as Galaxy, BioConductor, Taverna, or to add further annotation and visualization of results (e.g., Cytoscape, http://www. cytoscape.org). This integration has been made possible through MartServices. BioMart can be easily configured to become a DAS annotation server for viewing of data through various Distributed Annotation System (DAS) clients.
Integration, Warehousing, and Analysis Strategies of Omics Data 411
Filters for affymetrix data
Sequences attribute
l1trtt~c•~ Homo sapiens Fl~t4 Affy hg u9SaV21D{s): 00-llst
Ensembl Gent) lD Flan1<1Gen$) Upstream ttank(1000) Chromosome Name Gene End {bp)
• Gene Start (bp)
m~ta~e't {None Setectad]
a to be Included In the output and hit 1Reaults' when ready
0 Features 0 Homologs
1 0 Structures ~ Sequences O Varlatlons
I o .. l •
,s S~<:)UENCES:
1 Sequences (max 1)
I I
! ~---.~"{]}-··--1 Selected options I t t t . . .. ' ..
Unsphced (Transcript) OUnsphced (Gene} 0 Flank (Transcript) ®Flank (Gene) OFiank-codlng region (Transcript) OFiank-codmg region (Gene)
Fig. 8. Example for sequence attributes, filters, and results after the selection of given options in the MART window.
4. Notes
1. Further functional studies provide additional detailed information, e.g., on drugability of target genes adding value to the discovery of novel therapeutics ( 19).
2. XML is a common set of well-defined data formats and is the format of choice for storing i11formation witl1 an inherent l1ierarchical structure. XML has been widely accepted in Omics for data exchange, migration, and storage.
3. Grid Resource Allocation Manager or Globus Resource Allocation Manager (GRAM) is a software component of the Globus Toolkit that can locate, submit, monitor, and cancel
Fig. 9. Selected features, attributes, filters, and results after the selected options from the MART viewer showing the GO-annotated tables.
jobs on Grid computing resources. It provides reliable operation, stateful monitoring, credential management, and file staging. GRAM does not provide job scheduler functionality and is in fact just a front-end (or interoperability bridge) to the functionality provided by an external scheduler that does not natively support the Globus Web service protocols. REDUCE is a general-purpose computer algebra system geared toward applications in physics.
Integration, Warehousing, and Analysis Strategies of Omics Data 413
4. Mdra1v is a11 AN I drawing tool writte11 i11 C# usir1g t l1c 1110110 platforn1 Mfindcr.
5. o far, n1crgir1g of 0111ics data l1as fl.tt1dan1CI1tally contributed to basic biological researcl1 for derivi11g tnodcls a11d COI1tro llcd vocabularies for at111otati11g biological processes. 011 a Stlbsequetlt level, pharJnacoget10tnics a11d pl1arn1acoprotcc)t11ics l1ave e111erged to study, e.g., drtig pharmacody11amic a11d pl1armacoki11etic sttidies witl1 refcrct1ce to 11tltnat1 a11d other orga11isms, allowi11g tl1e a11alysis of small n1olecule drtigs as well as biologicals.
6. BioTapestry is a11 i11teractive tool for buildit1g, visualizing, a11d simulati11g genetic regt1latory networl<s. T l1e tool is also used for Interactive Web Models.
References
1. Caspi, R., Foerster, H ., Fulcher, C.A., 1(aipa, P., Krummenacker, M ., Latendresse, M ., Paley, S., R11ee, S.Y., Shearer, A.G ., and Tissier, C. (2008 ) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/ genome databases. Nucleic Acids R es 36, D623- 31.
2. Srinubabu, G. (2009) Computational systems biology of - Omics data: integration, warehousing and validation . BIT Life Sciences) 2nd Annual World Summit of Antivirals, July 18- 20, 2009, Beijing, China.
3. Hanuman, T., Raghava, N.M., Siva, P.A., Mrithyunjaya, R.K., Chandra, S.V., Allam, A.R. , and Srinubabu, G. (2009 ) Perfortnance comparative in classification algorithms using real datasets. J Comput Sci Syst Biol2, 97- 100.
4. Tetsuro, T., Yoshiki M ., l(eith, P., Naohiko, H ., Norio, K. , and Yoshiyuki, S. (2007) OmicBrowse: a browser of multidimensional omics annotations. Bioinformatics 23, 524-26.
5. Avraham, S., Tung, C .W., 1lic, K., Jaiswal, P., Kellogg, E.A., McCouch, S., Pujar, A., Reiser, L., Rhee, S.Y., Sachs, M.M., Schaeffer, M ., Stein, L. , Stevens, P., Vincent, L., Zapata, F., and Ware, D. (2008 ) T he Plant Ontology Database: a community resource for plant structure and developtnental stages controlled vocabulary and annotations. Nucleic Acids R es 36, D449.
6. Sidhu, A.S., Dillon, T.S., and Chang, E. (2006) Advances in Protein Ontology Project. Computer-Based Medical Systems CBMS 19th IEEE International Symposium 588- 92.
7. Ashburner M. et al. (2000) Gene ontology: tool for the unification of biology. Nat Genet 25, 25- 29.
8. Satya, S.S., Christopher, 1-.., Amit, S., Cory, H ., and William, S. (2005) GLYDE - An expressive XML standard for the representation of glycan structure. Carbohydr R es 18, 2802- 7.
9. Syed, S.H., Benoit, B. , Richard, I-I., Darin , L., Gudtnundur, T., and Arek, 1(. (2009 ) BioMart - biological queries made easy. BMC Genomics 10, 22.
10. Vandervalk, B. P., McCarthy, E. L., and Wilkinson , M .D . (2009 ) Moby and Moby 2: creatures of the deep (web). Brief Bioinjorm 10, 114-28.
11 . Burgun , A., and Bodenreider, 0. (2008 ) Accessing and integrating data and knowledge for biomedical research . France Yearb Med Inform 91- 101 .
12 . Akula, S.P., Miriyala, R.N., T hota, I-I ., Rao, A.A., and Srinubabu , G. (2009) Techniques for integrating -omics data. Bioinformation 3, 284-86.
13. Wei, W., Michael, C. ]., Yigal, N. , Etnmitt, ]. , David, B., and Hao, L. (2005) Inference of combinatorial regulation in yeast transcriptional networks: a case study of sporulation . Proc Natl Acad Sci USA 102, 1998- 0 3.
14. Crispin, R., and Harmen , ] .B. (2008) REDUCE: an online tool for inferring cisregulatory elements and transcriptional module activities from microarray data. Nucleic Acids R es 31 , 3487- 90.
15. Bar-Joseph, Z., Gerber, G.l(. , I.Jee, 1.1., Rinaldi , N.J. , Yoo, J.Y., Robert, F., Gordon, D .B., Fraenkel, E., Jaakkola, T.S., Young, R .A., and Gifford, D .1(. (2003) Computational discovery of gene modules and regulatory networks. Nat Biotechnol21 , 1337-42.