Top Banner
Integration, Warehousing, and Analysis Strategies of Om i cs Data Sr i nub a bu Gede la Abst r ac t "- Otnic s" i a current uffix for numerous type of large- calc biological data ge n eration proc edures, which naturally demand the development of novel al gorithm for data storage and analy i . With ne xt ge neration genome sequencing burgeoning, it i pivotal to decipher a coding ite on the ge nome , a ge ne' func tion , and information on transc ript next to the pure avail ab ili ty of se quence inf ormation. To expl ore a genome and downstream mole cular pro cesses, we need umpt ee n result at the various levels of cellular or ganization by utilizing different experim enta l designs, data analys is st r ateg ies an d methodolo - g ies. Here comes the need for contro ll ed vocabu laries and data integration to ann otate, s tore, a nd update the fl ow of experimenta l data. Thi ch apter explores key methodolo gies to merge Otnic data by semantic data carriers, discusses co ntrolled vocabu l ar ies as cXtcn ible Markup Language (XMI _J ), and provide practical guida nce, databa c , and softwa re links s upportin g the inte gration of Otnic data. Key words: XML, RDF , ontrolled vocabularie , Omi c data, Warehousing, l)ata int eg ration 1. Introduction 1. 1. General Considerations Living ce ll are organized around orne centra l aspect , including co mp lex and integrated st ru cture, regulatory mcchani m (a homeo ta i ), g rowth a11d development, c 11 crgy utilizati011, rc pon e to the enviro ntne11tal st imuli, reproductic)n ( DNA guar - ai1tics (s emi )exact replica tion ), evoluti0 11 (capacity of li vi 11 g et1ti - ties to ada pt over time ), which all together arc reflected i11 ystems Biology. Latest research cxpa11ded toward a view of com - plex di ea es, al o inc lt1di11 g different pecies as, e.g., i11 tl1c y tern Biology of l1 o t- patl1ogen i11teractio11S. l)ata warehOLl e holdin g biological data of ge11e product , metabolite , a11d tno t importa11t al o their relatio11 l1ip and biochemical orga11ization i11 Jnet abo li c path way , are a cc11tral prerequisite for such y tetn Bernd Mayer {ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOl 10. 1007/978-1-61779-027 -0_18, © Springer Science+ Business Media, LLC 20 11 · 3 99
16

Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

Apr 25, 2018

Download

Documents

DuongAnh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

Integration, Warehousing, and Analysis Strategies of Omics Data

Srinubabu Gedela

Abstract

" -Otnics" i a current uffix for numerous type of large- calc biological data generation procedures, which naturally demand the development of novel algorithm for data storage and analy i . With next generation genome sequencing burgeoning, it i pivotal to decipher a coding ite on the genome, a gene' function , and information on transcript next to the pure availability of sequence information. To explore a genome and downstream molecular processes, we need umpteen result at the various levels of cellul ar o rganization by utilizing different experimental designs , data analysis strategies and methodolo­gies. Here comes the need for controlled vocabularies and data integration to annotate, store, and update the fl ow of experimental data. T hi chapter explores key methodologies to merge Otnic data by semantic data carriers, discusses controlled vocabularies as cXtcn ible Markup Language (XMI_J), and provide practical guidance, databa c , and software links supporting the integration of Otnic data .

Key words: XML, RDF, ontrolled vocabularie , Omic data, Warehousing, l)ata integration

1. Introduction

1. 1. General Considerations

Living cell are organized around orne central aspect , including complex and integrated structure, regulatory mcchani m (a homeo ta i ), growth a11d development, c11crgy utilizati011, rc pon e to the enviro ntne11tal stimuli , reproductic)n (DNA guar­ai1tics (semi)exact replication), evoluti011 (capacity of livi11g et1ti ­ties to adapt over time), which all together arc reflected i11 ystems Biology. Latest research cxpa11ded toward a ~ystcn1S view of com­plex di ea es, al o inclt1di11g different pecies as, e.g., i11 tl1c

y tern Biology of l1o t- patl1ogen i11teractio11S. l)ata warehOLl e holding biological data of ge11e product , metabolite , a11d tno t

importa11t al o their relatio11 l1ip and biochemical orga11ization i11 Jnetabolic pathway , are a cc11tral prerequisite for such y tetn

Bernd Mayer {ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOl 10.1007/978-1-61779-027 -0_18, © Springer Science+ Business Media, LLC 2011 ·

399

Page 2: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

400 Gedela

1.2. "-Omics" Data

Biology approaches. As an example, organizations as Biocyc (http://www.biocyc.org) provide tools for developing organism pecific metabolic pathway databa es from previously annotated

metabolite al o allowing the inclusion of newly annotated meta­bolites (1). Genome-wide or at lea t large-scale quantification of molecular components and experimental assessment of how these component interact have offered a broader insight into cellular function and into the effect of genetic and environmental per­turbations. Omics data include quantification of mRNA tran­scripts (tran criptome), protein abundance (proteome), metabolic fluxes (fluxome ), the concentration profiles of intracellular and extracellular metabolite (metabolome), and information on protein- protein and protein- D A interactions (interactome).

A variety of method has been derived for data analysis, inter­pretation of phenomenological observations, and quantitative prediction of cellular behavior. These methods include compara­tive analy i of Omic profiling (e.g., tati tical te t and reduction of dimen ionality method ), model for integrative analysis (e.g., graph theory - based models), and predictive models (2, 3). Selection of the appropriate type of models capable of appropriately handling a given problem plays an important role in the extrac­tion of knowledge, and experimental de ign should anticipate the planned data analysis trategy, and certainly follow a clear defini­tion of the biological hypothesis.

Biological information extracted in Omics include all level of exploration of cellular activities, spanning from the gene to expre -ion and further to the phenotype level. Main data levels are

genomic , transcriptomics, proteomic , glycomic , lipidomic , metabolomics, and localizomics. The functional tates of the e Omics data are, e.g., phenomics and fluxomic , involving the effective e pression level reflected by Ornic data and flux required for metabolite in pathways. Further Omics levels describe interactions as protein- protein or protein- DNA interac­tions. Data flows describing various components of Omics within a cell are hown in Fig. l.

These Omics procedures in turn generate enormous amount of data which need to be stored in efficient ways. Large-scale information is readily available in genome-scale Ornics reposito­ries while most of the dedicated databases store experimental data on the gene and protein level taken from various ources. A selected list of available data repositories, including a short description and URL, is provided in Table l.

Omics data are naturally retrieved from specific experimental procedures, and raw data as well as processed data and analysi results are represented in databases. Formally, the integration of Omics data can be described as an array in which all the individual array elements are interlinked. For respecting this fact, a design has to be chosen capable of representing both the experiments as

Page 3: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

Integration, Warehousing, and Analysis Strategies of Omics Data 401

-

Metabolism j

• Phenome

Flux

Fig. 1. Schematic representation of different components of Omics data and information flow within a cell.

,.

2. Materials

such and their re Lilt vector . The experimental de ign noted i11 Fig. 2 span from the ge11ome level to equence a11110tation and furtl1er to ORF validation by, e.g., utilizing microarray a11d

AGE. ubseqtient experiment a proteomic provide, e.g., information on po t-translatio11al modification (PTM ) centrally used for enzyme a11110tation in metabolic pathways. Analysi of the c Omics level allow tt1dyi11g interaction 011 ge11e regula­tory network a11d protei11- proteir1 i11tcraction 11envork ( cc

ote l ). Finally, hinctio11al a11notation co11tribute to an under­ta11dit1g of the overall expre io11 on tl1c gc11e level, a .. e.g., repre­et1ted i11 OmicBrovv e (http://omic pacc.riken.jp/ omicBrow c )

ii1tercot111ecting different Omic.. data co1nponent level in a en1antic fa hion (4).

On1ic.. data ar fi:eque11tly rcpre entcd i11 \'Ocabtilaric , repre ­ei1t ~ct e.g. a l1ierar l1ical data el n1e11t.._ aL o pro\riding tl1e ir1ter­

face tc) I1tin1eroti data r po itorie a11d a11aly i tool ( 5 ). For exa111plc tl1e protein 011tology ( l1ttp://I-1ir.georgctO\VI1.edu/ pro)

Page 4: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

402 Gedela

Table 1 Major Omics data resources

Data types

Components

Genon1ic

Transcriptornlcs

Proteomics

Lipido1nics

Localizonucs

GJycomics

Interactions

Protein- D A

Protein- protein

Functional states

Pheno1nics

Online resource

Genomes OnLine Database (GOLD)

Gene Expression Omnibus (GEO}

tanford Microarray Database ( MD)

World-2DPAGE

Open Proteomics Database (OPD)

Lipid Metabolite and Pathways trategy (LIPID MAPS)

Yeast GFP Fusion Localization Database

Con ortium of functional Glycomics

Biomolecular etvvork Database (BI D )

Encyclopedia of D A Elements (E CODE)

Munich Information Center for Prot in equence (MIP )

Database of Interrtcting Proteins (DIP)

RNAi database

General Repo itory for Interaction Datasets (GR1D)

A Systematic Annotation Package For Community Analysi of Genoroes (i\ AP)

Description

Repository of completed and ongoing genome

• proJeCts

Nlicroarray and SAGE­based genome-wide expression profiJes

Microarray-based genome­vvide expression data

Links to 2D-PAGE data

Mass-spectrometry-based proteomic data

Genome-scale lipids database

Yeast genome-scale protein-localization data

Glycan array and profile data

Publi bed protein- D A interactions

Databa e of functional elements in human D A

Links to protein- protein interaction data and re ources

Publi hed protein- protein interaction

C. elegan RNAi screen data

yntheric-lethal interac-. .

non 1n yeast

ingle ... gene--deletion microarray data tbr E. coli phenotypes

URL

http://~'W. genomesonli ne.org

http:// www.ncbi .nlm. nih .govjgeo

http:// genome wwvv.stanford.edu/

• nucroarray http://us.expasy.org/

ch2d/ 2d-index. html

http:/ /bioinformatic . icmb.utexas.edu/ OPD

http:/ / vvvvw.lipidmaps. org

http: I I yeastgfp. ucsf edu

http://W\V\v.function­alglycornics. org/

http: //v..ryvw. bind.ca/ Action/

http://genome.ucsc. edu/E CODE/ index.html

http://mips.gsf.del proj/ ppi

http:// dip .doe-mbi. ucla.edu

http: I I rnai.org

http:l/biodata. mshri. on.caj ¢d

http: I I \V\V\v~genolne .

\vi ~edujtool I asap.htm

Page 5: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

Integration, Warehousing, and Analysis Strategies of Omics Data 403

Sequence Annotation

l$otoplc Tracing · ·~ferearray,

Genomlcs SAGE

1i r.anscrrJptomlcs

l?.henotype arr.ays, RNAI Screens,

Synthetic lethal.

Type of Expew.i mental

designs

Post Translational Modifications,

Mass Spectrometry, M~tDI~TOF

Pbenomics

Yeast ... 2f.t, Co .AP-MS &hlp.chip,

Gene~o~regulator.y, Networks

Proteomlcs

Metabolite Al)undance,

Enzyme annotation

Fig. 2. Types of experimental designs: ChiP-chip (chromatin-immunoprecipitation- DNA-microarray); co AP-MS (co-affinity purification-mass-spectrometry); RNAi (RNA interference); SAGE (serial analysis of gene expression); yeast 2H (yeast two-hybrid analysis).

2.1. eXtensible Markup Languages

is a standard for supporting data integration, data mining, and models for derivi11g protein structural a11d ftu1ctiona1 properties ( 6 ). The Gene 011tology (GO, http://www.get1contology.org) is a co11trollcd vocabulary to functionally an110tate gene product with re pect to their biological process, molecular function, and cellular location (7). Various framework l1ave been derived for implementing vocabularic , including eXte11 ible Markup Languages (XML), l{.esotlrce Descriptio11 Framework (RDF), Open Biomedical Ontologies (OBO), and OWLWeb Ontology La11guage (OWL).

XML is a general-ptlrpose markup lat1gt1age tl1at tlpports data haring across heterogeneous systems, and provides a format of

choice for storing information with a11 inherent hierarchical struc­ture (see Note 2 ). XML has been widely accepted in tl1e Omics sciences as a standard for data exchange. Example for powerful XML-based data integration are the GLYcan Data Exchange (GLIDE, http://lsdis.cs. uga.edu/projects/ glycomics) enabling interoperability and exchange of glycomics data and more gene­rally on structures carrying glycan moieties as developed by Sahoo

Page 6: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

404 Gedela

2.2. Resource Description Framework

2.3. Open Biomedical Ontologies

2.4. OWLWeb Ontology Language

3. Methods

and his team (8), and BIOMART (http://www. biomart.org), an integration tool using XML syntax for building data elements from different databases providing user-defined queries (9 ).

RDF (http://www.w3.org/RDF) is a family ofWorld Wide Web Consortium (W3C) specifications originally designed as a meta­data data model. RDF is used as a general syntax for linking a wide variety of data in a single framework. RDF is, e.g., used to combine genome data and public domain annotations within GO, KEGG, and the SUPERFAMILY database (10).

OBI (http://obi.sourceforge.net) is a collection of controlled vocabularies freely available to the biomedical community. Web­based ontology portals, such as the BioPortal (http://bioportal. bioontology.org) allow users to browse, search, submit, and visu­alize ontologies. The need for innovative technology and methods that allow scientists to record, manage, and disseminate biomedi­cal information and knowledge in machine-processable form gave rise to the National Center for Biomedical Ontology (NCBO, http://www.bioontology.org) initiative created in 2005 ( ll ).

OWL (http://www.w3.org/TR/ owl-guide) facilitates further improved machine interpretability of Web content when com­pared to XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics. OWL has three sublanguages: OWL Lite, OWL DL, and OWL Full. They are described as

• OWL DL is an ontology language based on description logics (DLs).

• OWL Lite supports classification hierarchy and simple syntax.

• OWL Full with maximum expressiveness and syntactic freedom.

BioPAX (http://www.biopax.org) is an effort to create a data exchange format for biological pathway data utilizing OWL

• semantics.

Obviously, various methods for Omics data integration and analy­sis are available ( 12). First criteria for data integration depend upon the type of data; hence, most of the algorithms available are based on genomics, transcriptomics, and proteomics experimen­tal data. Other Omics data components like phenomics and flux­omics can be studied through integration of further analysis tools of the integration environments.

Page 7: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

a) Identifying Network Scaffold

Integration, W rehousing , nd Analysis Strategies of Omics Oat 405

_L __ _

Experimental Data

00 0

Whole cell or system model built by Biotapestry simulations

\ Network scaffold with Motif size-3

··-Input l t----T-....,.---

f,"l'"l I 1----1

0 0 0 0

b) Scaffold Decomposition

Transcription network built from Mdraw and Mfinder

c) Cellular modeling and analysis.

Fig. 3. A flow chart describing various steps of data integration toward network reconstruction and model building.

3. 1. Identifying, Decomposing, and Modeling

3. 1. 1. Identifying

a Network Scaffold

The procedure of identifying and decc)Jnp<>sing a network scaffold, fc>llc>wed by cellular systen1s tnodeling and analysis is scl1ematically de~1icted in fig. 3.

This task depicts the strategy f(>r identifying all interactic)ns between ()mics cc)m ~)onents. A typical exan1ple is the identifica­tion of a genc-regulatc>ry netwc)rk scaff<>ld by integrating chrc>­matin imtnuno~1recipitati<>n (ChiP) and tnicrc)array gene expression data (referred tC) as c:hJI">-chi~) data ). Such ()mics data specify the interactic>tls l1etwccn a transcri~1ti<>na l regu latc)r and its target gene, atld various statistical apprc)aches arc avai lable t<> derive the specific regulatory rcJatic)nshi~1 ( natncl y, transcri~1tic>na J

activation or repression ). l)ata C)J1 J)rotein- l)NA and pr<)tcin­protein interactomes reflect the activity <)fa cellular nctwc>rk, and a typical analysis strategy fc)JJows clustering <)f high throughput gene expressio11 data sets, CC)mplemcntcd by isc)lating the upstreatn rcgic)tls of clustered genes fc)r identifyitlg C<)tnmon cis-rcgulatc>ry motifs.

Tools like Module constructic)n using gene exprcssic)n and seq uence motifs (MODEM) ( 13 ) and regulatory-elen1cnt detec­tion using correlation with expression ( Rr~l)UCt~ , http:// busscmaker. bio.colum bia.edu/reducc) · ( 14) itnplcmented with scaffold bui lding algc)rithms based on the transcriptional mc)tifs

'

Page 8: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

406 Gedela

3. 1.2. Network Scaffold Decomposition

3.1.3. Cellular Systems

Modeling and Analysis

found in clustered gene expression data are available. Another approach i Genetic regulatory modules (GRAM, http://psrg.lc . mit.edu/ GRAM/ Index.html) (15) for identifying protein- D A binding events within et oftran cription factor (see ote 3).

Integrating Omics data in network modules and aligning such module into more complete network i the common procedure in recon tructing network . etwork module re t on available interactome data and are typically composed of a limited number of node . Such identified motifs represent the basic building block that compri e the cellular network. The incorporation of localizomic data further upport i alation of biologically rele­vant motif , a interacting component are with higher probabil­ity found in the same ubcellular location.

Methods as Stati tical Analy i of etwork Dynamic (SANDY, http://sandy.topnet.ger teinlab.org/) , method for biclu ter analy i (SAMBA, http:/ j www.cs. tau.ac.il/%7Ershamir/ expanderj expander.html), and tool like Mdraw (http://www. weizmann.ac.il/ mcb/ UriAlon/ etworkMotif SW / mdraw /) and Mfinder (http://www.weizmann.ac.il/ mcb/ UrWon/

etworkMotifsSW / mfinder/ MfinderManual.pdf) are available for con tructing correlative map ( ee ote 4 ). A representative map con tructed by u ing Mdraw i depicted in Fig. 3.

The availability of Omic data set open the way for efforts aimed at integrating diver e Ornic profile into whole-cell or sy tern model , panning from identification of network module to quantitative modeling and imulation ( ee ote 5 ). The constraint­based reconstruction and analy i (COBRA, http://gcrg.uc d. edu/ D ownloads/ Cobra_Toolbox) technique (16) has emerged in recent year as a succe ful approach for modeling ystem on a genome cale integrating genomic, proteomic, and other high throughput data. Thi toolbox can be downloaded for Matlab.

Next to a quantitative description of a cellular state, Omics data may al o be seen in the context of overall con traints from thermodynamics, rna conservation, reactions involved, etc. A recon­struction i here defined a the list of biochemical reactions occur­ring in a particular cellular procedure (as metabolism), and the associations between these reaction and relevant proteins, tran­scripts, and genes. A reconstruction can be converted to a model by including the assumptions nece ary for computational simula­tion, for example, maximum reaction rates and nutrient uptake rates, which results in a reconstruction of the cellular process encoded within the omics data. Latest methods in developing such cellular simulations involve tools like Biotapestry (http:// www. biotapestry.org) ( 17). A ample proces done using Biotapestry (see Note 6) is depicted in Fig. 4.

Page 9: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

lntegr tion Warehousing, nd Analysis Str t gies of Omics Oat 407

Network Reconstructed Reconstruction Network Application of constraints

States

Fig. 4. Constraint-Based Reconstruction and Analysis (COBRA) method.

3.2. DBE

'

The Data a11aly i and visualization systetn for Biological Experiment (DBE, http://www.bic-gl1.de/dbe) (18) describes a method for mapping metabolomics data i11to metabolite data, where DBE helps the scientists in managing, a11alyzit1g, a11d visu­alizing experimental data. DBE has a flow of compo11ents for handling omics data i11 mt1ltidisciplinary way. DBE-Web site pro­vides the user interface, tl1e DEE-Database supports cot1sistc11t data storage, support of data import is realized via Excel-based templates, DEE-Pictures supports handling of, e.g., image files, and DBE-Gravisto provides network analysis a11d visualization. Selected components are shown in Fig. 5.

For demonstrating DBE functio11alities, we use metabolite data available for seed development of beans (Vi cia 11arbonet1sis ). In this case, transgenic tecl1nology was applied for increasing protein accumulation via introducing the bacterial enzyme phospl1oenolpyruvate carboxylase (PEFC). The enzyme refixes HC03- liberated by respiration, and together with PEFC yields oxaloacetate that can either be co11verted to aspartate or into malate and other intermediates of the citric acid cycle. To characterize tl1e responsible metabolic shift within seeds from sugars/starch into organic acids/amino acids/proteins, the metabolite pattern for glycolysis, citrate cycle as well as related sugars a11d free ami110 acids was analyzed. Visualization of tnetabolites within their pathways (Fig. 6) gives an immediate overview of specific changes in metab­olism within transgenic seeds.

Page 10: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

408 Gedela

.Oraefe ~i datal(a~,

storage ot metabolit-e data

Data In MS~~et

Individual Components of DBE

visual~e lfie Cfata­enricfted n~twoiks

f o iG~iflfferQnt co ponents

online

u-pload abEl assign image files to experimentS

Fig. 5. DBE-Gravisto, a network analysis and graph visualization system.

3.3. 8/0MART BioMart (http://www.biomart.org) (9 ) is an open source data management system that comes with a range of query interfaces allowing the user to group and refine data based upon many dif­ferent criteria. The capabilities of BioMart are further extended by integrating several widely used software packages, such as BioConductor, DAS, Galaxy, Cytoscape, or Taverna. BIOMART provides a graphical as well as command line interface, and fur­thermore Web services or APis written in Perl and Java support­ing various database systems as MySQL, Oracle, and Postgres. Data integration involves four steps, namely, ( l ) querying, (2 ) configuration, (3) transformation, and (4 ) source data (Fig. 7 ).

Querying allows the user to select data, including filtering on the basis of attributes like the Gene ID or GO terms, providing a structured XML view. Configuration rests on XML for aligning heterogeneous data supporting structured querying. Transformation allows the data integration into the XML format from source data, and source data are available data sets which are parsed through PERL APis into a MySQL databases.

Three tier Architecture: First tier consists of one or more relational databases. Two

tools present in First tier are:

• Mart Builder to construct SQL statements for transforming a schema into a mart.

• Mart Editor for generating a data set configuration XML stored in metadata tables within the actual mart database.

Page 11: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

lnt gr tion, W r housing, nd An lysis tr t i s of Omics 0 ta 409

-Galactos Trehalos

Maltose

-- w. UDPglucose

D-Xylose

Fucose ctose 6-ph shate

D-

L-Arginine L-Lysine

eta-Alan in

L-Ornithine -Giutamat L-Giycine

-Ascorbat

L-Serine

-Asparagin L -Aspartate

-Threonin L-Valine

L-Cysteine

-Threonin

L -Norvaline

-Isoleucin

Fig. 6. Visualization of experimental data in the context of a metabolic network constructed by using the DBE-Gravisto standalone version 1.1 (beta).

Second tier is the Perl API which interacts with bc>th , the data set configuration and the mart databases.

Third tier consists of the qLiery interfaces wl1ich utilize the API to present the possible BioMart queries a11d results:

• Mart View, a Web browser interface.

• Mart Service, a Web services i11tcrface.

• MartURLAccess, a mart view based 011 Web URL.

Page 12: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

410 Gedela

Source Data

Transformation

XML XML XML Configuration

BioMart Software Querying

Fig. 7. Steps of data integration.

We show as practical example the analysis of the 1 kb upstream sequences of a cluster of human genes identified by an expression profile experiment using an Mfymetrix Genechip U95Av2.

The Homo sapiens genes data set is selected and filters of ID list limit in the GENE section is chosen. Selecting the Affy hg u95av2 ID(s) option provides an upload option for Affymetrix probeset IDs using the file Browse button, or alternatively by copy and paste of the data set into the text box. Data types i11clude complementary DNA (eDNA), peptides, coding regions, untrans­lated regions (UTRs), and exons with additional upstream and downstream flanking regions. In order to identify upstream regu­latory features in subsequent analysis, the l kb upstream flank sequence for each gene has to be selected (Fig. 8). The subse­quent data can be used for further annotation, e.g., by assigning GO terms to the Mfymetrix data via selecting respective filters and features attributes as shown in Fig. 9.

A number of external software packages have incorporated BioMart for enhancing querying capabilities, e.g., for using ser­vices as Galaxy, BioConductor, Taverna, or to add further annota­tion and visualization of results (e.g., Cytoscape, http://www. cytoscape.org). This integration has been made possible through MartServices. BioMart can be easily configured to become a DAS annotation server for viewing of data through various Distributed Annotation System (DAS) clients.

Page 13: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

Integration, Warehousing, and Analysis Strategies of Omics Data 411

Filters for affymetrix data

Sequences attribute

l1trtt~c•~ Homo sapiens Fl~t4 Affy hg u9SaV21D{s): 00-llst

Ensembl Gent) lD Flan1<1Gen$) Upstream ttank(1000) Chromosome Name Gene End {bp)

• Gene Start (bp)

m~ta~e't {None Setectad]

a to be Included In the output and hit 1Reaults' when ready

0 Features 0 Homologs

1 0 Structures ~ Sequences O Varlatlons

I o .. l •

,s S~<:)UENCES:

1 Sequences (max 1)

I I

! ~---.~"{]}-··--1 Selected options I t t t . . .. ' ..

Unsphced (Transcript) OUnsphced (Gene} 0 Flank (Transcript) ®Flank (Gene) OFiank-codlng region (Transcript) OFiank-codmg region (Gene)

Upstream flank I 0 Upstream tlanl{1.ooo ....... · ............. · .. . ,

Downstream flank 0 D str fl •1---·~·-.. ··---.. -· .. -,

1 own eam an"i. ......................................... J I 1 s Header lnfonnation

Gene Information S Ensembl Gene ID

I 0 DescnptJOn

..

05 1 UTR 03' UTR 0 Exon sequences 0 eDNA sequences 0 Coding sequence OProtein

.. .. . . ....

• • • • r • r r r ,

oo•oo 0 o t '" " I o o '" o o oooo o oo o o

• •• j '

I , II! ! I J <f I 1 T

..__ -·

• 0 • • .. • ..

It 1 .,'T tt'"r'(T

.. .. . " . ····· ....................... .

8 Chromosome Name 8 Gene Start (bp) ...w.-_._..___,_---.---., ___ ....__..~ ............. _ _ __ ____________ _____ - - ----------

•• blo:::: rnart ••

Ml\ltiVlf W

Export ell results to

Email notification to

I'F-ii'ij'"''····· ...................................................... :~L'l l'F'A5r'A:"'' ] 0 Uni ue results on .................................................. ,. .... ...... ~ .......... t:ifl ··············· q t\1

U.9. .... '' 1 rows as I f.~1 ~ .. ~. '' 0 Unique results only

> ~lS0000000044971ll 2328277 11 23218533 TCACCCTAATCAAOATATAAAACAlTTCCATCACCCCCAOCAAGTTCTTTCTTOTTCIII TOCAOTCAACCCCCOACCCCCATCTCAOOCAAOOOCTOATTOOAGCCTATTACTGOAOAT TAGI lllACCAG'ITCTfGMt:ITCATOTMATOOMTMllll l"IOACMTITCACTrCO AAOAOACAOACTATTAAAAAOACAAT!OIIIOTOTAATAAOTACACOCATOCTAAAAOTA CACOACCTOTTCCAOIIIIGGOTTAOCAAGGOATTCCTCOAAAAATTAAAAOMTAAACC OCTACTATCATTATTAOAAAOAATTAACCTOOAOTAOAAAO/\OOTOTTATOTOO/\TOGGO ATOOGOOACATOAAGOATOOOOOACATOGOOOATOOOOOACATOOOGOATOOOAOMOCO CAAOACATOOAOTOCOMTOAGOIIGOOOTCCAOCCAGMOOAAAAGCATOMCAAAGOC TTOOAGACMOAOCMTTCCOTCTOOACTCAAACCTOOATI-rCOTMCTIOICI CCA'ITI' TACTACCATCOMACTGCTOTOC1CICCAAOII II ITGCCCAG1"CCMOCAOGGCAATT/\ TOM TAAOTAAOACOCTCTACCM CATATITCAOTOIII'l'CCCCCA'ITAOACTOTMG'IT

Result of the above .. 11.'1)

attributes and filters ~ I

I ; ' ·'

~ CII JOAAAGCTAGAACTTCATAATOTAACCCTTOTTACOTAOTAOOCMOCATTAATAAA TACTGGTTOAATTAACOAATAATTOGOTOAOTOMTOAACOAACOCACOCATOCAAACCC 01\AAOTCCCTOOAOGAAATOGTCACCTTCOOAOOIII/\OTCTOGCCCAOMOCCCTAAO/\ CCACOOACTOTOCCAOGTCCCACTCCAAACOCCOOGOAOACGCTCTAGOCAAOCTACACO TTCTTTOCTOCOOTOCCACTCTAGCCOCOAOMCOCCOCTCTATOOCTOCOOOOOAOOOO CGGGGCTCOTOOOTOTCTCCOACCCIIIIIOTCCCOOCGC >ENS000000000938Jli2793431 4J 27811390 TOTGTTAOCCAOOACOOTCTTOATCTCCTAACATCOTOATCCACCCOCCTCOOCCTCCCA ..... fd

Fig. 8. Example for sequence attributes, filters, and results after the selection of given options in the MART window.

4. Notes

1. Further functional studies provide additional detailed infor­mation, e.g., on drugability of target genes adding value to the discovery of novel therapeutics ( 19).

2. XML is a common set of well-defined data formats and is the format of choice for storing i11formation witl1 an inherent l1ierarchical structure. XML has been widely accepted in Omics for data exchange, migration, and storage.

3. Grid Resource Allocation Manager or Globus Resource Allocation Manager (GRAM) is a software component of the Globus Toolkit that can locate, submit, monitor, and cancel

Page 14: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

412 Gedela

Attributes . · .

Eosembl Gene 10 • . Ensembl rr~n.s~~pt 10 . :. · .. 1\tfy H~ U9SAY~ · .. ::' GO'Desc~p@n\~ · ·. ,. · GO o~sciiptloi'i'

GO Description

•• bio:::: rnart ••

GO Oese'TlPiiOfL GO )P.l::cttnttoo

, GO Des~pt\Ofl , ..

Features attribute

Please select columns to be included in the output and hit 'Results' when ready

@ Features 0 Homologs 0 Structures 0 Sequences 0 Variations

sGENE Ensembl 8 Ensembl Gene 10 e!Ensembl Transcnpt 10 0 Ensembl Protem 10

stable IO(s) O Oescnpl:lon

0 AssoCiated Gene Name O Assoctated Transcnpt Name OAssoctated Gene DB OAssoCiated Transcript DB OTranscript count

0 Chromosome Name OGene Start (bp) OGene End (bp) OStrand

Selected options annotation

U ~ GC content ~ene B1otype OTranscnpt B1otype

OBand O Transcnpt Start (bp) O Transcnpr End (bp)

sEXTERNAL go biological process OGOID 0 GO OescnptJon

go cellu lar component OGr:•IO BGCt [lt>ScnpUc•n

Annotated affymetrix data with Gene ontology (GO)

OSource OStatus (gene) 0 Starus (transcnpt)

OGO E¥1d8nce Code

MArHVIlW

r .. ·-·-·· ·-·- ~ r-·---J.f.~'.e. ......................................... -............. ..... ~. ~~- .. . 0 Umque results on~

Email notJfican J

View 0 Umque results onty

GO Desulption GO

(NSOO:«llXiOO!',S ENSTIOOl)Jn-476 1001_at prot em amino acid membrane

ENSGf.XX'roll&l56 ENSTt:OD]l72q6 1001_at membrane

ENSGCOXXXJB}56 ENSTOCO:l)l72~76 lOOI_at membrane

ENS~ ENSJIXXX)1.37~76 lOOl_at prolein amino ac1d membrane

ENSC{.Wllllil5Q ENSTOO!.ll3'l2Mfj 1001 _at protein amino ac1d membrane

ENSGIOIDXJ'll0056 ENSJTJ,Xl.'()372476 1001_at membrane

ENSGOO'lllli005§ ENS! lXlXlX92476 1001_at membrane

membrane

~ ENST!lXll);3Z2AZ6 l001_at membrane

ENSoo:nxn:m56 ENSlllXlm72"76 1001_el membrane

GO Desalptlon

nucleotide binding

transmembrane receptor protein tyrosine kinase

receptor activity

AlP binding

transferase activity

protein kinase aciMty

protein tyrosine kinase actrtity

calcium ion binding

protein serinellhreonine kinase activity

Fig. 9. Selected features, attributes, filters, and results after the selected options from the MART viewer showing the GO-annotated tables.

jobs on Grid computing resources. It provides reliable operation, stateful monitoring, credential management, and file staging. GRAM does not provide job scheduler functionality and is in fact just a front-end (or interoperability bridge) to the func­tionality provided by an external scheduler that does not natively support the Globus Web service protocols. REDUCE is a general-purpose computer algebra system geared toward applications in physics.

Page 15: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

Integration, Warehousing, and Analysis Strategies of Omics Data 413

4. Mdra1v is a11 AN I drawing tool writte11 i11 C# usir1g t l1c 1110110 platforn1 Mfindcr.

5. o far, n1crgir1g of 0111ics data l1as fl.tt1dan1CI1tally contributed to basic biological researcl1 for derivi11g tnodcls a11d COI1tro llcd vocabularies for at111otati11g biological processes. 011 a Stlbse­quetlt level, pharJnacoget10tnics a11d pl1arn1acoprotcc)t11ics l1ave e111erged to study, e.g., drtig pharmacody11amic a11d pl1armacoki11etic sttidies witl1 refcrct1ce to 11tltnat1 a11d other orga11isms, allowi11g tl1e a11alysis of small n1olecule drtigs as well as biologicals.

6. BioTapestry is a11 i11teractive tool for buildit1g, visualizing, a11d simulati11g genetic regt1latory networl<s. T l1e tool is also used for Interactive Web Models.

References

1. Caspi, R., Foerster, H ., Fulcher, C.A., 1(aipa, P., Krummenacker, M ., Latendresse, M ., Paley, S., R11ee, S.Y., Shearer, A.G ., and Tissier, C. (2008 ) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/ genome data­bases. Nucleic Acids R es 36, D623- 31.

2. Srinubabu, G. (2009) Computational systems biology of - Omics data: integration, ware­housing and validation . BIT Life Sciences) 2nd Annual World Summit of Antivirals, July 18- 20, 2009, Beijing, China.

3. Hanuman, T., Raghava, N.M., Siva, P.A., Mrithyunjaya, R.K., Chandra, S.V., Allam, A.R. , and Srinubabu, G. (2009 ) Perfortnance comparative in classification algorithms using real datasets. J Comput Sci Syst Biol2, 97- 100.

4. Tetsuro, T., Yoshiki M ., l(eith, P., Naohiko, H ., Norio, K. , and Yoshiyuki, S. (2007) OmicBrowse: a browser of multidimensional omics annotations. Bioinformatics 23, 524-26.

5. Avraham, S., Tung, C .W., 1lic, K., Jaiswal, P., Kellogg, E.A., McCouch, S., Pujar, A., Reiser, L., Rhee, S.Y., Sachs, M.M., Schaeffer, M ., Stein, L. , Stevens, P., Vincent, L., Zapata, F., and Ware, D. (2008 ) T he Plant Ontology Database: a community resource for plant structure and developtnental stages controlled vocabulary and annotations. Nucleic Acids R es 36, D449.

6. Sidhu, A.S., Dillon, T.S., and Chang, E. (2006) Advances in Protein Ontology Project. Computer-Based Medical Systems CBMS 19th IEEE International Symposium 588- 92.

7. Ashburner M. et al. (2000) Gene ontology: tool for the unification of biology. Nat Genet 25, 25- 29.

8. Satya, S.S., Christopher, 1-.., Amit, S., Cory, H ., and William, S. (2005) GLYDE - An expressive XML standard for the representa­tion of glycan structure. Carbohydr R es 18, 2802- 7.

9. Syed, S.H., Benoit, B. , Richard, I-I., Darin , L., Gudtnundur, T., and Arek, 1(. (2009 ) BioMart - biological queries made easy. BMC Genomics 10, 22.

10. Vandervalk, B. P., McCarthy, E. L., and Wilkinson , M .D . (2009 ) Moby and Moby 2: creatures of the deep (web). Brief Bioinjorm 10, 114-28.

11 . Burgun , A., and Bodenreider, 0. (2008 ) Accessing and integrating data and knowledge for biomedical research . France Yearb Med Inform 91- 101 .

12 . Akula, S.P., Miriyala, R.N., T hota, I-I ., Rao, A.A., and Srinubabu , G. (2009) Techniques for integrating -omics data. Bioinformation 3, 284-86.

13. Wei, W., Michael, C. ]., Yigal, N. , Etnmitt, ]. , David, B., and Hao, L. (2005) Inference of combinatorial regulation in yeast transcriptional networks: a case study of sporulation . Proc Natl Acad Sci USA 102, 1998- 0 3.

14. Crispin, R., and Harmen , ] .B. (2008) REDUCE: an online tool for inferring cis­regulatory elements and transcriptional mod­ule activities from microarray data. Nucleic Acids R es 31 , 3487- 90.

15. Bar-Joseph, Z., Gerber, G.l(. , I.Jee, 1.1., Rinaldi , N.J. , Yoo, J.Y., Robert, F., Gordon, D .B., Fraenkel, E., Jaakkola, T.S., Young, R .A., and Gifford, D .1(. (2003) Computational discovery of gene modules and regulatory networks. Nat Biotechnol21 , 1337-42.

Page 16: Integration, Warehousing, and Analysis Strategies … Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract "-Otnics" i a current uffix for numerous type of

414 Gedela

16. ')..._on , 1\ .B., Adam, ~I.F., r 1onica L.11., ;rcgory, I-1.~ Bernhard, P., and, iarkus, J.H.

( 2007 ) )uantitativc: prediction of cellular rnctabolisnl \Vith constrJint -bascd mcxlcls: rhc

lllu\ l·oolb . . 1 nr Protoc 2, 227- 3 . 17. Longab,tugh, \V.J .It., Eric, H .I ., and Hamid

B. (2005) ' ornpurational rcprc nrarion of dcvclopnlcnt.tl generic regulatory ncn,·orks. 1 t'J' Bu1l 2 , 1- 16.

18. Ljudmilla, B., 1ohammad -Rcza, H. , hristian, K., Hard~', R., and FaJ , . 2005)

Inrcgraring dara from biological expcri -mcnrs inro metabolic ncn\'or \\'irh me l BE informarion sy rem. /11 ilico Bioi 5, 93- 101.

19. 1 cnong \V, and nnubabu, G. 2008 Insight of nt.~\' tools an glycomics research. J l'rot,OIIIics BioitJforJII l 374-78.