Quick Intro to InterMine within AIP and MTGD - JCVI Research Works-in-Progress Meeting

Post on 17-Jul-2015

79 Views

Category:

Science

3 Downloads

Preview:

Click to see full reader

Transcript

InterMineIntegrated Data Warehouse

Use Cases: Arabidopsis & Medicago Genome Projects

Vivek KrishnakumarPlant Genomics Group (EUK)

IFX Research WIPS Meeting, 03 October 2014

Overview

• Introduction

• InterMine Integrated data warehouse, Extensible data model,

Flexible query system

Web and Programmatic Interface

Other InterMine instances

• Use cases Arabidopsis Information Portal (AIP)

Medicago truncatula Genome Database (MTGD)

• Summary Advantages

Caveats

Introduction

For genome projects that wish to expose their data via the web (query, visualize, warehouse) to foster scientific collaboration, there are several technologies available:

• JCVI developed software Manatee (backed by an RDBMS)

• Externally developed software BioMart (federated from various databases)

Tripal (powered by Drupal, backed by CHADOdb)

InterMine

InterMine

• Functions as a data warehouse for the integration of complex

biological data. Integration across data types occurs based on

a common identifier (e.g. gene primary ID)

• Uses a flexible and extensible data model, controlled by XML

files, driven by ontologies (Sequence [SO], Gene [SO], etc.)

Genomics, Proteomics, Interactions, Homology,

Expression, Pathways (and more data types)

Parsers for commonly used biological data formats

Provides framework for adding your own data

• Offers a flexible query system, optimized via precomputed

tables (no need for schema denormalization)

Smith, RN. et al. InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data

Bioinformatics (2012) 28 (23): 3163-3165

InterMine (contd.)

• Provides a user-friendly web interface exposing powerful features: Analysis of lists (facilitate enrichment studies)

Full-featured report pages (one-stop shop)

Interactive result tables (sort, filter, summarize)

Visual query builder (no need to write SQL!)

Quick search and Region-based search

• Fosters development of external applications using data hosted within InterMine via Application Programming Interfaces (API): RESTful

Perl, Python, Ruby, Java, JavaScript

Kalderimis, A. et al. InterMine: extensive web services for modern biology

Nucl. Acids Res. (1 July 2014) 42 (W1): W468-W472

Public “Mines”

• InterMine supports querying across mines

for cross-database integration

• Vast number of warehouses powered by

InterMine already exist

Arabidopsis Information Portal (AIP)

• AIP origins Funded by NSF in response to community needs, following

termination of funding to TAIR

• AIP objectives Develop a community web resource that…

– is sustainable and fundable and community-extensible

– hosts analysis & visualization tools, user data spaces

Federation: integrate diverse data sets from distributed data sources; foster development of tools for and by the community

Maintenance of the Col-0 gold standard annotation

• AIP methods Assimilate TAIR data

Host an InterMine instance devoted to Arabidopsis (thale cress)

Offer and consume RESTful web services

Integrate and utilize iPlant resources

ThaleMinehttps://apps.araport.org/thalemine

• An InterMine interface to Arabidopsis genomic data

• Integrates a wide variety of data types (A-E, H), some of which are warehoused and others are federated via web services

• Embedded elements visualizing gene structure (JBrowse, not shown), interaction networks (F), expression patterns (G)

Visual Query Builder

Image created by Benjamin Rosen (Bioinformatics Analyst, Plant Genomics Group)

Images created by Benjamin Rosen (Bioinformatics Analyst, Plant Genomics Group)

Inte

racti

ve R

esu

lt T

ab

les

Reg

ion

-based

searc

h

MedicMinehttp://medicmine.jcvi.org

• NSF funded project to assist with the curation of the Medicago truncatula Genome Assembly and Annotation (funding ended August 2014)

• In order to warehouse and prolong the project data, an InterMine interface for Medicago was implemented (backed by a CHADO database)

• Provides similar kind of functionality available via ThaleMine

Summary

• Advantages InterMine is a powerful biological data warehouse

Performs complex data integration

Allows fast and flexible querying

Well documented programmatic interface

Cookie-cutter, user-friendly web interface

Facilitates cross-talk between “mines”

• Caveats Adding more data requires a full database rebuild (incremental loading

is not possible) because of the integration step

• About InterMine: Developed by the Micklem Lab at the University of Cambridge, UK

Written in Java, backed by PostgreSQLdb, deployed under Tomcat. Documentation and downloads available at http://www.intermine.org

Chris Town, PI

Lisa McDonald

Education and

Outreach

Coordinator

Chris Nelson

PMJason Miller, Co-PI

Technical Lead

Erik Ferlanti

SE

Vivek Krishnakumar

BESvetlana Karamycheva

BE

Eva Huala

Project lead, TAIR

Bob Muller

Technical lead, TAIR

Gos Micklem, co-PI Sergio Contrino

Software Engineer

Matt Vaughn

co-PI Steve Mock

Advanced Computing

Interfaces

Rion Dooley,

Web and Cloud

Services

Matt Hanlon,

Web and Mobile

Applications

Maria Kim

BE

Ben Rosen

BA

top related