A Reusable, Open Source Tool Chain for Building Relational Databases from XML Sources BOSC Stockholm, Sweden June 27, 2009 Kam D. Dahlquist Alexandrea Alphonso Chad Villaflores Department of Biology John David N. Dionisio Derek Smith Department of Electrical Engineering & Computer Science http://xmlpipedb.cs.lmu.edu Loyola Marymount University
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Reusable, Open Source Tool Chainfor Building Relational Databasesfrom XML Sources
BOSCStockholm, Sweden
June 27, 2009
Kam D. DahlquistAlexandrea AlphonsoChad VillafloresDepartment of Biology
John David N. DionisioDerek Smith Department of Electrical Engineering& Computer Science
http://xmlpipedb.cs.lmu.edu
Loyola Marymount University
Outline• Motivation
--GenMAPP--Project requirements
• XMLPipeDB Implementation--XSD-to-DB--UniProtDB and GODB--XMLPipeDB Utilities--GenMAPP Builder
• Lessons Learned--How robust is our system to changes in
XML formats?--How well does our system work with other
common bioinformatics XML formats?
How GenMAPP Workshttp://www.GenMAPP.org
• Graphics tools make MAPPs that store gene IDs and vector coordinates for all graphical objects
• Separate Expression Dataset filesstore data and color-codinginstructions
• Gene Databases store IDs,annotation, and hyperlinks topublic gene and protein databases
• Stand-alone program implemented in Visual Basic, accessory files are Microsoft Access databases
Maintaining and Updating GenMAPPGene Databases has been a Bottleneck
for Development
• Microarrays use different gene ID systems for annotation; users want as much information as possible.
• We need to capture and reliably relate gene data from different sources and keep the data updated.
• Gene Database design is data-driven; it tells GenMAPP what gene ID systems and relationships are present.
• Current GenMAPP Gene Databases are built from Ensembl as the main data source.-- limited to (mostly) animal species-- sensitive to changes in flat file formats
XMLPipeDB: A Reusable, Open Source Tool Chainfor Building Relational Databases from XML Sources
Requirements:• to create Gene Databases for other species (bacteria/plants) using UniProt as the main data source• to be robust to changes in source file formats• to use XML sources wherever possible• to take advantage of existing open source tools• to limit the manual manipulation of the data
Data sources required for a minimalGenMAPP Gene Database:• UniProt XML (complete proteome sets from Integr8)• Gene Ontology OBO-XML• GOA gene association files (also from Integr8)
XMLPipeDB Use Case Diagram
XSD-to-DB is based on Hyperjaxb2• Reads an XSD or DTD• Automatically generates: