GUS The Genomics Unified Schema A Platform for Genomics Databases V. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G. Grant, Y. Kondrahkin, L.Li, J. Liu, J. Mazzarelli, D. Pinney, A. Pizarro, E. Manduchi, S. McWeeney, J. Schug, C. Stoeckert Center for Bioinformatics, University of Pennsylvania stevef,[email protected]
30
Embed
GUS The G enomics U nified S chema A Platform for Genomics Databases
GUS The G enomics U nified S chema A Platform for Genomics Databases. V. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G. Grant, Y. Kondrahkin, L.Li, J. Liu, J. Mazzarelli, D. Pinney, A. Pizarro, E. Manduchi, S. McWeeney, J. Schug, C. Stoeckert - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
GUSThe Genomics Unified Schema A Platform for Genomics DatabasesV. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G.
Grant, Y. Kondrahkin, L.Li, J. Liu, J. Mazzarelli, D. Pinney, A. Pizarro, E. Manduchi, S. McWeeney, J. Schug, C. Stoeckert
Center for Bioinformatics, University of Pennsylvania
AbstractThe Genomics Unified Schema (GUS) is a strongly typed relational database
schema and accompanying portable object-based software platform used for integration, analysis, curation, mining and presentation of sequence based genomics information. The schema is organized into five domains: a detailed model of the central dogma (gene, RNA, protein) including DNA, assembled RNA, and protein sequence, and a diversity of sequence annotation (DoTS); an MGED compliant warehouse of transcript expression experiments (RAD); a catalogue of grammars describing regulatory regions (TESS); a wide range of controlled vocabularies and ontologies (SRES); and a detailed representation of data provenance (CORE). (A sixth domain for protein expression is in progress.) GUS’s normalized relational structure and extent of integrated data enable powerful queries not viable in many other genomics systems. The platform facilitates maintenance of the warehouse and its utilization in web and data mining applications.
Goals of GUS Generic platform for model organism or disease specific
databases Freely available at www.gusdev.org and www.cbil.upenn.edu Integration of genome, transcript and protein data, including:
Sequence Function Expression Interaction Regulation Orthologs and paralogs
Support for: automated annotation and integration manual curation data mining/analysis and sophisticated queries web access
Applications Annotator’s interface Parsers and exporters (using standards) Annotation and analysis programs
Schema browser Utilizes Oracle 9i
AutomatedAnalysis &Integration
WWW queries,
browsing, & download
Java Servlets &
Perl CGI
Mining
Applications
DoTS Oracle/SQL
GenomicSequence
microarray& SAGE
Experiments
MappingData
GenBank, InterPro,
GO, etc
GSSs &ESTs
Annotation QTL,POP,SNP, Clinical
RAD Core SRes
Object Layer
TESS
Annotator’s Interface
Architecture of GUS
Usage of GUS
Annotation Of genomes: gene models, sequence features Of genes: function, expression, regulation
Integration From sequence to expression Map identifiers to/from external databases
Data mining, creating curated datasets Algorithm-based: GO function prediction Genome-wide querying: find all pancreas-specific transcripts PANCchip: non-redundant genes expressed in pancreas found using
And vocabularies External database names Genetic codes Review status
Evidence trail
Evidence and tracking Data tables have columns for user, date, project, algorithm invocation Tables dedicated to algorithm, algorithm version and parameters 176 algorithms, including public and in-house Tracks automated and manual annotation, similarity and integration
Versioning All updated or deleted rows are copied to version table
Sophisticated queries
Sample queries from three projects that utilize GUS’s data integration and analysis
www.allgenes.org “Is my cDNA similar to any mouse genes that are predicted to encode
transcription factors and have been localized to mouse chromosome 5?”
http://plasmodb.org “List all genes whose proteins are predicted to contain a signal peptide
and for which there is evidence that they are expressed in Plasmodium falciparum’s late schizont stage”
www.cbil.upenn.edu/EPConDB “Which genes on chromosome 2 are expressed in pancreas and are
involved in signal transduction based on GO function assignments.”
Application Frameworks
GUS Object layer
Lightweight Perl implementation Java on the way One object per table Parent/child relationships Cascading delete
Data input
The GusApplication program manages inserts and updates to GUS, handling tracking and versioning.
Specific tasks are implemented as plugins. Plugins use either GUS objects or SQL access. Low-level database access is provided by DBI classes.
RAD TESSDoTS
CoreSResDBIPlugin
ObjectObjectObjectObjectObject S
uperClasses
SQL
GusApplication
Pipeline
Perl API for defining annotation pipelines Supports sequential protocols Distributes compute intensive work to compute cluster Used for 90 stage pipeline to build DoTS transcript index
Web
Servlets and cgi based design (JSP on the way) Automatic generation of HTML FORMs
Automated input checking Integrated help features INPUT elements populated from the database
Query history facility Boolean queries (AND, OR, SUBTRACT) Declarative configuration file Base system is relatively independent of GUS
GO functional assignment Expression analysis (PaGE) Anatomy classification Library distribution Genes from BLAT of DoTS against genome DoTS assembly and annotation
Refresh warehouse Cluster and assemble mRNAs/ESTs into putative transcripts Annotate transcripts through similarity, GO function and markers Integrate previously existing manual curation
Functional predictions
GenomicSequence
DoTS consensusSequences
mRNA/ESTSequence
Clustering andAssembly
PredictedGenes
GeneIndex
Merge Genes
Gene/RNA clusterassignment
SIM4 or BLAT
ProteinsRNAs
Gene predictionsGenScan/ HMMer, PHAT
GO Functions
ProteinMotifs
BLAST Similarities
PFAM, Smart, ProDomBLASTPBLASTX
Other computed annotation(EPCR,
AssemblyAnatomyPercent,Index Key Words,
SNP analysis)
Annotate DoTSManual Annotation
Tasks
translationframefinder
DoTS Pipeline
References & Acknowledgements References
Scearce, L. Marie, Brestelli, John E., McWeeney, Shannon K., Lee, Catherine S., Mazzarelli, Joan, Pinney, Deborah F., Pizarro, Angel, Stoeckert, C. J. Jr., Clifton, Sandra, Permutt, M. Alan, Brown, Juliana, Melton, Douglas A., Kaestner, Klaus H. (2002) Functional Genomics of the Endocrine Pancreas: The Pancreas Clone Set and PancChip, New Resources for Diabetes Research Diabetes 51: 1997-2004, 2002.
Schug, J., Diskin, S., Mazzarelli, J., Brunk, Brian P., Stoeckert, C.J. (2002) Predicting Gene Ontology Functions from ProDom and CDD Protein Domains. Genome Res. 2002 12: 648-655.
Bahl, A., Brunk, B., Coppel, R.L., Crabtree, J., Diskin, S.J., Fraunholz, M.J., Grant, G.R., Gupta, D., Huestis, R.L., Kissinger, J.C., Labo, P., Li, L., McWeeney, S.K., Milgram, A.J., Roos, D.S., Schug, J., Stoeckert, C.J. (2002) PlasmoDB: The Plasmodium Genome Resource. An integrated database providing tools for accessing and analyzing mapping, expression and sequence data (both finished and unfinished). Nucleic Acids Res. 2002 30: 87-90
Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C.P., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., Vingron, M. (2001) Minimum Information About a Microarray Experiment (MIAME): Toward Standards for Microarray Data. Nature Genetics 29:365-371, 2001.
Manduchi, E., Pizarro, A., Stoeckert, C. (2001) RAD (RNA Abundance Database): an infrastructure for array data analysis. Proc. SPIE, vol 4266, pp. 68-78.
Davidson, S.B., Crabtree, J., Brunk, Brian P., Schug, J., Tannen, V., Overton, G.C., Stoeckert, C.J. Jr. (2001) K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources. IBM Systems Journal: 40(2), p. 512-531.
Crabtree, J., Wiltshire, T., Brunk, B., Zhao, S., Schug, J., Stoeckert, C., Bucan, M. (2001) High-resolution BAC-based Map of the Central Portion of Mouse Chromosome 5. Genome Res. October 2001; 11: 1746-1757.
Acknowledgements NIH grant RO1-HG-01539-03 DOE grant DE-FG02-00ER62893 Burroughs Wellcome Fund NIDDK 56947 and 56954 with cosponsorship from the JDFI
Related posters
114A. Web-Based Biological Discovery using the GUS Integrated Database.
170A. TESS-II: Describing and Finding Gene Regulatory Sequences with Grammars
148A. Integrating Eukaryotic Genomes by Orthologous Groups: What is Unique about Apicomplexan Parasites?