InterMine Integrated Data Warehouse Use Cases: Arabidopsis & Medicago Genome Projects Vivek Krishnakumar Plant Genomics Group (EUK) IFX Research WIPS Meeting, 03 October 2014
Jul 17, 2015
InterMineIntegrated Data Warehouse
Use Cases: Arabidopsis & Medicago Genome Projects
Vivek KrishnakumarPlant Genomics Group (EUK)
IFX Research WIPS Meeting, 03 October 2014
Overview
• Introduction
• InterMine Integrated data warehouse, Extensible data model,
Flexible query system
Web and Programmatic Interface
Other InterMine instances
• Use cases Arabidopsis Information Portal (AIP)
Medicago truncatula Genome Database (MTGD)
• Summary Advantages
Caveats
Introduction
For genome projects that wish to expose their data via the web (query, visualize, warehouse) to foster scientific collaboration, there are several technologies available:
• JCVI developed software Manatee (backed by an RDBMS)
• Externally developed software BioMart (federated from various databases)
Tripal (powered by Drupal, backed by CHADOdb)
InterMine
InterMine
• Functions as a data warehouse for the integration of complex
biological data. Integration across data types occurs based on
a common identifier (e.g. gene primary ID)
• Uses a flexible and extensible data model, controlled by XML
files, driven by ontologies (Sequence [SO], Gene [SO], etc.)
Genomics, Proteomics, Interactions, Homology,
Expression, Pathways (and more data types)
Parsers for commonly used biological data formats
Provides framework for adding your own data
• Offers a flexible query system, optimized via precomputed
tables (no need for schema denormalization)
Smith, RN. et al. InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data
Bioinformatics (2012) 28 (23): 3163-3165
InterMine (contd.)
• Provides a user-friendly web interface exposing powerful features: Analysis of lists (facilitate enrichment studies)
Full-featured report pages (one-stop shop)
Interactive result tables (sort, filter, summarize)
Visual query builder (no need to write SQL!)
Quick search and Region-based search
• Fosters development of external applications using data hosted within InterMine via Application Programming Interfaces (API): RESTful
Perl, Python, Ruby, Java, JavaScript
Kalderimis, A. et al. InterMine: extensive web services for modern biology
Nucl. Acids Res. (1 July 2014) 42 (W1): W468-W472
Public “Mines”
• InterMine supports querying across mines
for cross-database integration
• Vast number of warehouses powered by
InterMine already exist
Arabidopsis Information Portal (AIP)
• AIP origins Funded by NSF in response to community needs, following
termination of funding to TAIR
• AIP objectives Develop a community web resource that…
– is sustainable and fundable and community-extensible
– hosts analysis & visualization tools, user data spaces
Federation: integrate diverse data sets from distributed data sources; foster development of tools for and by the community
Maintenance of the Col-0 gold standard annotation
• AIP methods Assimilate TAIR data
Host an InterMine instance devoted to Arabidopsis (thale cress)
Offer and consume RESTful web services
Integrate and utilize iPlant resources
ThaleMinehttps://apps.araport.org/thalemine
• An InterMine interface to Arabidopsis genomic data
• Integrates a wide variety of data types (A-E, H), some of which are warehoused and others are federated via web services
• Embedded elements visualizing gene structure (JBrowse, not shown), interaction networks (F), expression patterns (G)
Visual Query Builder
Image created by Benjamin Rosen (Bioinformatics Analyst, Plant Genomics Group)
Images created by Benjamin Rosen (Bioinformatics Analyst, Plant Genomics Group)
Inte
racti
ve R
esu
lt T
ab
les
Reg
ion
-based
searc
h
MedicMinehttp://medicmine.jcvi.org
• NSF funded project to assist with the curation of the Medicago truncatula Genome Assembly and Annotation (funding ended August 2014)
• In order to warehouse and prolong the project data, an InterMine interface for Medicago was implemented (backed by a CHADO database)
• Provides similar kind of functionality available via ThaleMine
Summary
• Advantages InterMine is a powerful biological data warehouse
Performs complex data integration
Allows fast and flexible querying
Well documented programmatic interface
Cookie-cutter, user-friendly web interface
Facilitates cross-talk between “mines”
• Caveats Adding more data requires a full database rebuild (incremental loading
is not possible) because of the integration step
• About InterMine: Developed by the Micklem Lab at the University of Cambridge, UK
Written in Java, backed by PostgreSQLdb, deployed under Tomcat. Documentation and downloads available at http://www.intermine.org
Chris Town, PI
Lisa McDonald
Education and
Outreach
Coordinator
Chris Nelson
PMJason Miller, Co-PI
Technical Lead
Erik Ferlanti
SE
Vivek Krishnakumar
BESvetlana Karamycheva
BE
Eva Huala
Project lead, TAIR
Bob Muller
Technical lead, TAIR
Gos Micklem, co-PI Sergio Contrino
Software Engineer
Matt Vaughn
co-PI Steve Mock
Advanced Computing
Interfaces
Rion Dooley,
Web and Cloud
Services
Matt Hanlon,
Web and Mobile
Applications
Maria Kim
BE
Ben Rosen
BA