Top Banner
A Reusable, Open Source Tool Chain for Building Relational Databases from XML Sources BOSC Stockholm, Sweden June 27, 2009 Kam D. Dahlquist Alexandrea Alphonso Chad Villaflores Department of Biology John David N. Dionisio Derek Smith Department of Electrical Engineering & Computer Science http://xmlpipedb.cs.lmu.edu Loyola Marymount University
16
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dahlquist_XMLPipedB_BOSC2009

A Reusable, Open Source Tool Chainfor Building Relational Databasesfrom XML Sources

BOSCStockholm, Sweden

June 27, 2009

Kam D. DahlquistAlexandrea AlphonsoChad VillafloresDepartment of Biology

John David N. DionisioDerek Smith Department of Electrical Engineering& Computer Science

http://xmlpipedb.cs.lmu.edu

Loyola Marymount University

Page 2: Dahlquist_XMLPipedB_BOSC2009

Outline• Motivation

--GenMAPP--Project requirements

• XMLPipeDB Implementation--XSD-to-DB--UniProtDB and GODB--XMLPipeDB Utilities--GenMAPP Builder

• Lessons Learned--How robust is our system to changes in

XML formats?--How well does our system work with other

common bioinformatics XML formats?

Page 3: Dahlquist_XMLPipedB_BOSC2009

How GenMAPP Workshttp://www.GenMAPP.org

• Graphics tools make MAPPs that store gene IDs and vector coordinates for all graphical objects

• Separate Expression Dataset filesstore data and color-codinginstructions

• Gene Databases store IDs,annotation, and hyperlinks topublic gene and protein databases

• MAPPFinder performs GeneOntology over-representationanalysis

• Stand-alone program implemented in Visual Basic, accessory files are Microsoft Access databases

Page 4: Dahlquist_XMLPipedB_BOSC2009

Maintaining and Updating GenMAPPGene Databases has been a Bottleneck

for Development

• Microarrays use different gene ID systems for annotation; users want as much information as possible.

• We need to capture and reliably relate gene data from different sources and keep the data updated.

• Gene Database design is data-driven; it tells GenMAPP what gene ID systems and relationships are present.

• Current GenMAPP Gene Databases are built from Ensembl as the main data source.-- limited to (mostly) animal species-- sensitive to changes in flat file formats

Page 5: Dahlquist_XMLPipedB_BOSC2009

XMLPipeDB: A Reusable, Open Source Tool Chainfor Building Relational Databases from XML Sources

Requirements:• to create Gene Databases for other species (bacteria/plants) using UniProt as the main data source• to be robust to changes in source file formats• to use XML sources wherever possible• to take advantage of existing open source tools• to limit the manual manipulation of the data

Data sources required for a minimalGenMAPP Gene Database:• UniProt XML (complete proteome sets from Integr8)• Gene Ontology OBO-XML• GOA gene association files (also from Integr8)

Page 6: Dahlquist_XMLPipedB_BOSC2009

XMLPipeDB Use Case Diagram

Page 7: Dahlquist_XMLPipedB_BOSC2009

XSD-to-DB is based on Hyperjaxb2• Reads an XSD or DTD• Automatically generates:

-- SQL schema-- Java classes-- Hibernate mappings-- Apache Ant build.xml file

Page 8: Dahlquist_XMLPipedB_BOSC2009

UniProtDB and GODB Required Only Nominal Post-processing

• XML cannot use SQL reserved words

• Datatypes must be supported in SQL

Page 9: Dahlquist_XMLPipedB_BOSC2009

XMLPipeDB Utilities are Reusable

• XML files are broken down into 25 record chunks for import

• TallyEngine counts records in XML and relational database

Page 10: Dahlquist_XMLPipedB_BOSC2009

GenMAPP Builder then produces…

Page 11: Dahlquist_XMLPipedB_BOSC2009

GenMAPP Gene Databases

• Escherichia coli K12• Arabidopsis thaliana• Vibrio cholerae• Plasmodium falciparum

Page 12: Dahlquist_XMLPipedB_BOSC2009

Workflow for Interdisciplinary Undergraduate Student Projects

• Created new species profiles for Vibrio and Plasmodium

• Re-analyzed published microarray datasets

Page 13: Dahlquist_XMLPipedB_BOSC2009

How robust is our system to changes in XML formats?

• Data-driven GenMAPP Gene Database design allowed our system to pick up RefSeq and NCBI Gene IDs “for free” from cross-references in UniProt XML

• The UniProt and GO XML schemas have each changed twice during GenMAPP Builder development, and were handled mostly automatically

Page 14: Dahlquist_XMLPipedB_BOSC2009

How robust is our system to changes in XML formats?

• Data-driven GenMAPP Gene Database design allowed our system to pick up RefSeq and NCBI Gene IDs “for free” from cross-references in UniProt XML

• The UniProt and GO XML schemas have each changed twice during GenMAPP Builder development, and were handled mostly automatically

• However, XML sources need to keep their own XSDs updated!

• Each new species does require additional coding to handle the vagaries of its own gene ID system

Page 15: Dahlquist_XMLPipedB_BOSC2009

How Well Do Bioinformatics XML Formats Perform with XMLPipeDB?

Data Source XSD-to-DB

Successful creation of PostgresQL database with automatically generated schema.sql

Successful build of XMLPipeDB Utilities with ant and import of XML data into PostgreSQL

Export Gene Database with GenMAPP Builder

UniProt XML

GO OBO-XML

(Post-processing performed)

KEML (KEGG) N/A

EMBL Nucleotide

EMBL CDS

mzML (Mass spec data)

AGML (2D gel data)

dbSNP (NCBI)

Syntax error with naming or datatypes; would require post-processing

MiniML (NCBI GEO)

GPML (GenMAPP)

BioMart

PDBML

HUP-ML (proteomics data)

PubChem

RNAML

Error: property with the same name is generated from more than one schema component

SBML

MathML

CellML

Multiple other NCBI DTDs

Multiple dependent XSDs or DTDs could not be processed

Page 16: Dahlquist_XMLPipedB_BOSC2009

Acknowledgments

http://xmlpipedb.cs.lmu.edu

Initial DevelopmentJoey BarrettJoe BoyleAdam CarassoDavid HoffmanBabak NaffasRyan NakamotoJeffrey NicholasRoberto RuizScott Spicer

Current DevelopmentAlexandrea AlphonsoDerek SmithChad Villaflores… and the rest of the undergraduates from the Fall 2008 Biological Databases class

Kam D. [email protected]

John David N. [email protected]

http://sourceforge.net/projects/xmlpipedb