A Reusable, Open Source Tool Chainfor Building Relational Databasesfrom XML Sources
BOSCStockholm, Sweden
June 27, 2009
Kam D. DahlquistAlexandrea AlphonsoChad VillafloresDepartment of Biology
John David N. DionisioDerek Smith Department of Electrical Engineering& Computer Science
http://xmlpipedb.cs.lmu.edu
Loyola Marymount University
Outline• Motivation
--GenMAPP--Project requirements
• XMLPipeDB Implementation--XSD-to-DB--UniProtDB and GODB--XMLPipeDB Utilities--GenMAPP Builder
• Lessons Learned--How robust is our system to changes in
XML formats?--How well does our system work with other
common bioinformatics XML formats?
How GenMAPP Workshttp://www.GenMAPP.org
• Graphics tools make MAPPs that store gene IDs and vector coordinates for all graphical objects
• Separate Expression Dataset filesstore data and color-codinginstructions
• Gene Databases store IDs,annotation, and hyperlinks topublic gene and protein databases
• MAPPFinder performs GeneOntology over-representationanalysis
• Stand-alone program implemented in Visual Basic, accessory files are Microsoft Access databases
Maintaining and Updating GenMAPPGene Databases has been a Bottleneck
for Development
• Microarrays use different gene ID systems for annotation; users want as much information as possible.
• We need to capture and reliably relate gene data from different sources and keep the data updated.
• Gene Database design is data-driven; it tells GenMAPP what gene ID systems and relationships are present.
• Current GenMAPP Gene Databases are built from Ensembl as the main data source.-- limited to (mostly) animal species-- sensitive to changes in flat file formats
XMLPipeDB: A Reusable, Open Source Tool Chainfor Building Relational Databases from XML Sources
Requirements:• to create Gene Databases for other species (bacteria/plants) using UniProt as the main data source• to be robust to changes in source file formats• to use XML sources wherever possible• to take advantage of existing open source tools• to limit the manual manipulation of the data
Data sources required for a minimalGenMAPP Gene Database:• UniProt XML (complete proteome sets from Integr8)• Gene Ontology OBO-XML• GOA gene association files (also from Integr8)
XMLPipeDB Use Case Diagram
XSD-to-DB is based on Hyperjaxb2• Reads an XSD or DTD• Automatically generates:
-- SQL schema-- Java classes-- Hibernate mappings-- Apache Ant build.xml file
UniProtDB and GODB Required Only Nominal Post-processing
• XML cannot use SQL reserved words
• Datatypes must be supported in SQL
XMLPipeDB Utilities are Reusable
• XML files are broken down into 25 record chunks for import
• TallyEngine counts records in XML and relational database
GenMAPP Builder then produces…
GenMAPP Gene Databases
• Escherichia coli K12• Arabidopsis thaliana• Vibrio cholerae• Plasmodium falciparum
Workflow for Interdisciplinary Undergraduate Student Projects
• Created new species profiles for Vibrio and Plasmodium
• Re-analyzed published microarray datasets
How robust is our system to changes in XML formats?
• Data-driven GenMAPP Gene Database design allowed our system to pick up RefSeq and NCBI Gene IDs “for free” from cross-references in UniProt XML
• The UniProt and GO XML schemas have each changed twice during GenMAPP Builder development, and were handled mostly automatically
How robust is our system to changes in XML formats?
• Data-driven GenMAPP Gene Database design allowed our system to pick up RefSeq and NCBI Gene IDs “for free” from cross-references in UniProt XML
• The UniProt and GO XML schemas have each changed twice during GenMAPP Builder development, and were handled mostly automatically
• However, XML sources need to keep their own XSDs updated!
• Each new species does require additional coding to handle the vagaries of its own gene ID system
How Well Do Bioinformatics XML Formats Perform with XMLPipeDB?
Data Source XSD-to-DB
Successful creation of PostgresQL database with automatically generated schema.sql
Successful build of XMLPipeDB Utilities with ant and import of XML data into PostgreSQL
Export Gene Database with GenMAPP Builder
UniProt XML
GO OBO-XML
(Post-processing performed)
KEML (KEGG) N/A
EMBL Nucleotide
EMBL CDS
mzML (Mass spec data)
AGML (2D gel data)
dbSNP (NCBI)
Syntax error with naming or datatypes; would require post-processing
MiniML (NCBI GEO)
GPML (GenMAPP)
BioMart
PDBML
HUP-ML (proteomics data)
PubChem
RNAML
Error: property with the same name is generated from more than one schema component
SBML
MathML
CellML
Multiple other NCBI DTDs
Multiple dependent XSDs or DTDs could not be processed
Acknowledgments
http://xmlpipedb.cs.lmu.edu
Initial DevelopmentJoey BarrettJoe BoyleAdam CarassoDavid HoffmanBabak NaffasRyan NakamotoJeffrey NicholasRoberto RuizScott Spicer
Current DevelopmentAlexandrea AlphonsoDerek SmithChad Villaflores… and the rest of the undergraduates from the Fall 2008 Biological Databases class
Kam D. [email protected]
John David N. [email protected]
http://sourceforge.net/projects/xmlpipedb