Top Banner
araport.org Arabidopsis Information Portal: A new approach to data sharing and cooperative development Matt Vaughn Director, Life Sciences Computing Texas Advanced Computing Center
54

Arabidopsis Information Portal overview from Plant Biology Europe 2014

May 22, 2015

Download

Science

Matthew Vaughn

An overview of the design, technical decisions, and implementation of the Arabidopsis Information Portal community-extensible data sharing and analytics platform.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Arabidopsis Information Portal: A new approach to data sharing and

cooperative development

Matt VaughnDirector, Life Sciences ComputingTexas Advanced Computing Center

Page 2: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Overview

• Rationale for the AIP• Strategic objectives• Current state of the platform• Data federation architecture• Immediate future plans• How you can participate

Page 3: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

The Rationale for AIP

• Loss of TAIR as a publicly funded shared resource for data mining and basic bioinformatics

• Centralization as a key contributing factor– Loading of new data into database– Development of new user experience– Curation and annotation– Community support mission

• AIP is designed to be de-centralized

Page 4: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

IAIC Workshop Design

Page 5: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

AIP Proposed Architecture

Page 6: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

• Objectives– Develop a community web resource

• Sustainably fundable and community-extensible• Hosts diverse analysis & visualization tools + user data spaces

– Support Federation to integrate diverse data sets from distributed data sources

– Maintain the Col-0 gold standard annotation

• Methods– Assimilate TAIR10 data– Host an Arabidopsis InterMine– Develop a strategy to allow federation– Offer and consume well-designed RESTful web services– Interoperate with iPlant (and other projects) wherever

possible

The AIP Strategy (1)

Page 7: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

The AIP Strategy (2)

• Key Design Decisions– Centralized (but powerful) data warehousing capability PLUS

infrastructure enabling data federation– Jbrowse as a genome browser platform– WebApollo + Tripal for community annotation– App store model for graphical data interfaces (complete with

3rd party developer path)– Data store model for data sources– Accessible languages and frameworks– Secure & modern single-sign on– Web service access to Arabidopsis data for powerful

bioinformatics– Geo-replication and high availability– Code re-use from other projects wherever possible– Full code release in real time via GitHub

Page 8: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Araport Bill of Materials• AIP is currently built using– InterMine*– Jbrowse 1.11.3*– Drupal 7.25*

• Developer-oriented content management system

– Angular.js, Bootstrap.js and other web toolkits– Agave Software as a Service platform

• Developed by the iPlant Collaborative• Bulk data, metadata, authentication, HPC app & job

management, notifications & events, and more• OAuth2 single-sign-on

– Internally-developed API manager*With extensive customization

Page 9: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Page 10: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Jbrowse

Currently hosts TAIR10 data; hierarchical track sets coming soon, including updated miRNAs and their targets and epigenomic data from EPIC-CoGe

Page 11: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Page 12: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

ThaleMine

Why InterMine?

① There aren’t a lot of real Arabidopsis web services

② InterMine is a scalable, extensible data warehouse

③ InterMine offers a rich, extensible web application

④ InterMine offers high quality REST APIs

⑤ InterMine is used by other MODs

ThaleMine is an Arabidopsis-specific deployment of

InterMine

Page 13: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Page 14: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.orgThaleMine provides enhanced Gene Report functionality

Page 15: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Page 16: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Page 17: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Powerful Search (1)

Page 18: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Powerful Search (2)

Page 19: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Queries can be stored as templates for re-use or modification by you or others (if made public)

Query Builder & Templates

Page 20: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Page 21: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

These displays share an AIP web service and are prototypes for AIP Science Apps

Extending the AIP

Page 22: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

What is a Science App?

– Written in HTML/CSS/Javascript using standard frameworks

– Presented via web browser• Query or Analyze, Present, Persist

– Developed by AIP and/or the community• Deployed in AIP “app store”• Choose which ones you want installed in your

Araport “dashboard”

– Uses AIP Data Architecture• Data services: Local and remote query/retrieval• Data integration and aggregation services• Computation services

Page 23: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Araport Architecture

Agave Enterprise Service Bus

CLI clients, Scripts, 3rd

party applications

Physical resources

HPC | Files | DB

Agave Services

apps

meta

files

profile

jobssystems

Araport API Managermanage

enroll

a b c d e f

AIP & 3rd party data providers

API Mediators• Simple proxy• Mediator• Aggregator• Filter

• Single-sign on• Throttling• Unified

logging• API versioning• Automatic

HTTPS

REST*

REST-likeSOAP

POX

Cambrian CGI

Page 24: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Data API Design Details (1)

• 100% RESTful services• Queries are JSON objects (conforming

to a JSON schema)• To enroll a new service in API Manager– Specify the mapping between AIP query

fields and your service–Map common query terms to minimal

controlled vocabulary– Describe all service-specific parameters

Page 25: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Data API Details (2)

When field mapping isn’t enough:• Code-based transformations can be

specified via– Python– Java– Ruby– Javascript

• In technical terms, this is known as MEDIATION

Page 26: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Data API Details (3)

• Results returned in a standard Agave JSON format*– status, message, result

• Result is an array of JSON objects• These conform to specific schemas– drafts on AIP GitHub soon for comment

*Unless there’s an operational reason not to

Page 27: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Data API Details (4)

• All Data APIs will implement:– Count: How many records found?– Pagination: Return only subsets– Help: Return a usage page– Convert: JSON (native), XML, CSV, etc

Page 28: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

• Docker.io for packaging• Ultra-portable dev

environment• Wide language

support• Implicit security

model• Scales horizontally

for performance• Data API is package of

metadata + a Docker file registered with a central arbiter service

• Also used for services written natively for AIP

Objectives: Facile development by end users; simple, secure deployment to AIP systems; reasonable performance

Araport Data Federation Architecture

AGAVE

API MANAGER

https://github.com/waltermoreira/apim

Page 29: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

End result: Araport Data API Store

Page 30: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

End result: Araport Data API Store

curl -X GET -k -v -L -b cookies https://api.araport.org/store/site/blocks/api/listing/ajax/list.jag?action=getAllPublishedAPIs

{ "apis": [ {"name":"InteractionBrowser", "provider":"vaughn", "version":"pr2-0.1", "context":"/data/BioAnalyticResource/interactionBrowser", "status":"Deployed", "thumbnailurl":"images/api-default.png", "visibility":null, "visibleRoles":null, "description":"InteractionBrowser", "apiOwner":"vaughn", "isAdvertiseOnly":false},

Page 31: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

SNP data Epigenomic

data via CoGeRNA-seq for expression and structural annotation

Aracyc

Co-expression data Orthologs, trees,

alignments

Various genomes & data sets

Community annotation using Web Apollo and Tripal

Interactions

Plans for next 3-6 months

Developer support & training

Page 32: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Feature AIP TAIR

GBrowse with TAIR10 data Yes Yes

JBrowse with TAIR10 data YES; also embedded in gene-info page) No

Epigenomic tracks from EPIC Yes No

Affymetrix expression data Yes (from BAR); embedded in gene-info pages Some but not searchable by locus

Protein interaction data Yes (from BAR; expansion planned)Similar data set; view through N-Browse

Gene-info/Locus-detail page (list data types)

gene sequence Yes Yes

CDS Yes Yes

GO annotation Yes Yes

PO and PATO 8/31/13 Legacy data Yes (8/31/13; some updates)

Curator summary Yes (TAIR; 8/31/13) Yes (8/31/13; some updates)

Computational description Yes (TAIR; 8/31/13) Yes (8/31/13; some updates)

Literature Yes; TAIR legacy, Uniprot and NCBI Yes; NCBI + some manual curation

Flexible query interface Yes No

Paywall NO YES

BLAST services Soon Yes

Web services Yes No

Data dowloads Yes Yes

Links to stock centers In progress Yes

1001 genomes SNP data In progress No

RNA-seq expression data Soon No

Updates to Col-0 sequence and annotation YES from AIP

As conceived and funded, AIP’s mission was to be a replacement for TAIR, emphasizing computational over human curation and integrating a wider range of data types through federation. With the rebirth of TAIR through a subscription mechanism, the roles of the two data centers in the Arabidopsis data marketplace has become an evolving process. TAIR will continue its enrichment of Col-0 annotation through literature curation etc. AIP will continue to aggregate and integrate data through a combination of federation via web services and assimilation.

Relationship between AIP and TAIR

Page 33: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Getting Involved with AIP

• User workshop at upcoming ICAR• Formal developer engagement begins soon– Developer discussion at the ICAR meeting on

conjunction with Araport Alpha release– SDK and tutorials available thereafter– 2-day dev workshop in Austin in Fall* 2014

• For now, send email at [email protected] describing what you’d like to do– We’ll reach out to you to discuss feasibility and

timelines via video conference

Page 34: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Summary• Next-generation MOD allowing

community participation in its development

• Powerful interactive query and analysis functions available today

• Developing a data federation model• New data sets and functions coming at

a quick pace• Be on the lookout for participation

opportunities

Page 35: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Chris Town, PI

Lisa McDonaldEducation and Outreach Coordinator

Chris NelsonProject Manager

Jason Miller, Co-PIJCVI Technical Lead

Erik FerlantiSoftware Engineer

Vivek KrishnakumarBioinf. Engineer

Svetlana KaramychevaBioinf Engineer

Eva HualaProject lead, TAIR

Bob MullerTechnical lead, TAIR

Gos Micklem, co-PI Sergio ContrinoSoftware Engineer

Matt Vaughnco-PI

Steve MockPortal Engineer

Rion Dooley, API Engineer

Matt Hanlon, Portal Engineer

Maria KimBioinf Engineer

Ben RosenBioinf Analyst

Joe Stubbs, API Engineer

Walter Moreira, API Engineer

Page 37: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Page 38: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

API Manager + Enterprise Service Bus

Araport architecture (2)

Secure, rationalized REST services

Consumer Applications

Simple Proxy

ThaleMine, Data

integration, other services

Cache

XML-to-JSON

SOAP-to-REST

CGI-to-REST

Throttle

Legacy API A

Legacy API B

REST API C

Simple Proxy

• Single-sign on

• Throttling• Unified

logging• API

versioning• Mediation

and translation

• Dev-friendly interfaces

• Rationalized REST for consumer apps

Media

tors

Page 39: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Science Objectives

• Make more, varied data available to the Arabidopsis (and other) communities within a unified user experience

• Enhance the innate value of data by offering enhanced search, retrieval, and display capabilities

• Facilitate analysis of user data• Enable community participation in

functional annotation

Page 40: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Technical Objectives

• Deploy a responsive, flexible community-extensible system

• Provide APIs everywhere!• Promote and facilitate data integration• Enable language- and region-specific

presentation of scientific content• Meet mobile computing on its own

terms

Page 41: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Local vs. Data-driven Apps

Resources are local and inherently offline.

Operating on local data using local computing.

Resources are cloud-based and inherently online. Multiple data streams integrated, queried,

presented in context of broader objective.

Photoshop Express KAYAK Pro

Page 42: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Araport Bill of Materials

• Araport is currently built using– Drupal 7.25

• Developer-oriented content management system

– Bootstrap.js and some other Javascript toolkits– InterMine (with modifications)– Bioinformatics infrastructure + misc. other bits– Agave 2.0 Software as a Service platform

• Developed by iPlant Collaborative project• Bulk data, metadata, authentication, HPC app and job

management, notifications & events, and more• OAuth2 out of the box• Enterprise service bus (ESB) architecture• http://agaveapi.co/

Page 43: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Agave wso2 interface

Cache (Technology TBD)

CSV

Araport APIM Architecture (1)

POLYMORPH CGI

Form

Input Key Map

Output Key Map

InputTransfor

m

OutputTransfor

m

Listen Respond

Send Listen

Input Key Map

Output Key Map

InputTransfor

m

OutputTransfor

m

Listen Respond

Send Listen

Araport API Manager

JSON Query JSON Response

ElasticSearch

Remote Services

SNP by Locus REST Indel by Position REST Enroll Manage

Page 44: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Araport Architecture: Use Cases (1)

• 1001 Genomes POLYMORPH tools– Provides variation data via locus or positional

search– Total of seven variant types available for search– Search parameterization depends a lot on

variant type– Example of a plain-text CGI service– Returns results as CSV with named columns

• Objective: Transform into a RESTful API that expects and returns rationalized JSON

http://polymorph.weigelworld.org

Page 45: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Araport Architecture: Use Cases (2)

• ThaleMine– Has native REST interface for general queries– Has templates which can form basis of

specific services

• Objective: Offer both Intermine-native and AIP-conformant interfaces as Data APIs

• Current path– Enroll native services in our APIM– Develop template-based AIP-conformant

serviceshttp://polymorph.weigelworld.org

Page 46: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Data APIs: Getting StartedService Queries Notes

BAR eFP Locus  

BAR Expressologs Locus  

BAR Interactions Locus  

COGe Position Special case – output transform only

NASC $SERVICE Locus SOAP based but may be offline permanently

OrthologFinder Locus Based on a Thalemine template

POLYMORPH Locus, Position  Actually seven CGI services

SUBA3 Locus  

Compiling example queries, parameter mapping and description, and ideal results for use in implementing the system

Page 47: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Developing a Data API

• In order, we prefer that you have ready• Well-documented REST• Moderately well-documented REST• SOAP services (plus WSDL or WADL)• Plain Old XML• Plaintext CGI• HTML CGI• No web services at all

• Work with us to enroll your services as a data source. This will involve a minor amount of coding.

Page 48: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Computational App Model (1)

Host file systems

Host OSDocker.io

Centos 6.4

custom-repo

Container

/scratch/

database

araport-compute-00

araport-storage-00

Host FS (250 GB)

TACC Corral (PB+)

sftp

Agave apps, data, jobs

REST API x JSON objects

Page 49: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Science Apps: Grid View• Current Scheme

• 2-3 column view w draggable apps

• Apps are normal, full-size, or collapsed

• Single app screen• Later in 2014

• N x X grid scheme implementing resizable app “tiles” like one sees in Android or Win8.x

• App SDK libraries will have “help” for enabling resizable design

• Multiple app screens

Page 50: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Data API Details (2)

• For service-specific parameters– Provide human-readable names mapped to original

parameter names– Offer minimal descriptive text– Specify validation

• Cardinality• Pattern validator (regex)• Type (number, string, etc.)

– Indicate whether required– Indicate whether they should be visible in a UI– Specify reasonable default values

• Seems familiar?– This approach is used to to abstract command line apps– Allows automatic generation of minimally functional UI

Page 51: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Data APIs: Response types (1)

• locus_relationship – pairwise relationship between A and B– Directionality– Type– Array of scores (weights, etc.)

• sequence_feature – positional attribute– Extension of GFF model plus– Build– Attributes array

Page 52: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Data APIs: Response types (2)

• locus_feature – key-value attributes per locus– Optional controlled vocabulary* for keys– Support for both slots and arrays

• raw – for returning images or other binary formats– Source and other metadata carried in X-headers instead

of JSON result– Outbound transformation still supported– Not a preferred response mode

• text – returning either native service response or a non-conformant JSON document– Source and other metadata carried in X-headers instead

of JSON result– Not a preferred response mode

Page 53: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Data API Details (6)

• Transparent caching will compensate for transient remote service failures

• Automatic indexing of certain response types via ElasticSearch, allowing for sophisticated global search– ElasticSearch allows us to index everything

we “know about” and return it quickly– iPlant uses it to live-index >700 TB user data

Page 54: Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org

Developing an app

• Understand and document the user stories you’re addressing with your app

• Identify all requisite data sources AND• Help us prepare them as Data APIs

– This may involve coding

• Understand the data integration or aggregation needs of your app– This may involve coding

• Develop the user interface(s) for your app using our tool kits and suggested practices– This will involve coding.– But you will learn tools like jQuery, Bootstrap, & D3 and will

thus be eminently employable!