Top Banner
DAS/2: Next Generation Distributed Annotation System Gregg Helt Gregg Helt 1 , Steve Chervitz , Steve Chervitz 1 , Tony Cox , Tony Cox 2 , Andrew Dalke , Andrew Dalke 3 , Allen , Allen Day Day 4 , Ed Erwin , Ed Erwin 1 , Ed Griffiths , Ed Griffiths 2 , and Lincoln Stein , and Lincoln Stein 4 (1) Affymetrix, Inc. (2) Sanger Institute (3) Dalke Scientific; (4) Cold Spring Harbor Laboratory (5) University of Alabama
24

DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

Mar 31, 2015

Download

Documents

Milton Haylett
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

DAS/2: Next Generation Distributed Annotation

SystemGregg HeltGregg Helt11, Steve Chervitz, Steve Chervitz11, Tony Cox, Tony Cox22, Andrew , Andrew DalkeDalke33, Allen Day, Allen Day44, Ed Erwin, Ed Erwin11, Ed Griffiths, Ed Griffiths22, and , and

Lincoln SteinLincoln Stein44

(1) Affymetrix, Inc.(2) Sanger Institute (3) Dalke Scientific;(4) Cold Spring Harbor Laboratory(5) University of Alabama

Page 2: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

Distributed Annotation System (DAS) Overview

A specification designed for sharing genome annotations

Defines client requests and server responses

Simplified Web Services approach: HTTP GET, URLs, XML

Intended to be simple to implement

No central annotation authority

Intended to support client-side integration of annotations from different servers

First draft specification Spring 2000

Last major change to DAS1 was Spring 2002

Grant from NIH awarded June 2004 for development of next-generation DAS/2

Page 3: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

DAS: Multiple Servers, Multiple Clients

Reference Server

AC003027AC005122M10154

Annotation Server Annotation Server

AC003027 M10154

WI1029 AFM820 AFM1126 WI443

AC005122

Annotation Server

Page 4: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

Widespread Adoption of DAS/1

Server Implementations– Dazzle, ProServer, LDAS

Server sites– Ensembl, UCSC, TIGR, KEGG, WormBase, Affymetrix,

etc.

Clients– GBrowse, Ensembl, Dasty, IGB,

Libraries:– BioPerl, BioJava, JDAS

DAS Extensions– GeneDAS (non-positional annotations)– DAS web services registry– SPICE (protein structures)– DALEC (asynchronous analysis)

Page 5: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

Ensembl is an ensemble of DAS servers

Page 6: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

GBrowse on Ensembl

Page 7: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

Distributed GBrowse

MyGBrowse

GBrowse 1

MODs

GBrowse 2

DAS

DAS

DAS

Ensembl UCSC

Page 8: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

DAS Limitations

No ontology (controlled vocabulary) of feature types.– Is a “gene” from DAS server 1 the same as a

“gene” from DAS server 2?

Not particularly extensible.

Ambiguous semantics for retrieving features that overlap a range on the genome.

Page 9: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

Development of DAS/2 Specification

Enhancements have largely been motivated by initial discussions on the DAS mailing list.– Series of RFCs collected– Though informal, still a long process!

Most recent DAS/2 draft specification is available at http://biodas.org/documents/das2/das2_protocol.html (tied to CVS repository), so anyone can review and comment

Feedback from the DAS developer and user communities will continue to guide future iterations of the DAS/2 specification

Page 10: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

Preserving DAS1 Strengths in DAS/2

Specification is independent of implementation– Many server implementations– Many client implementations

Simple, simple, simple– HTTP for transport– URLs for queries– XML for responses– REST-like style

Ontologies are integral

Focus on location-based annotations of biological sequences

Page 11: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

Basic DAS/2 Queries

Sources query: what genomes and versions of those genomes are available?

– http://server/das/genome

Regions query: what annotated sequences are available for a given version of a genome?

– http://server/das/genome/[genome]/[version]/region

Types query: what annotation types are availabe for a given genome version?

– http://server/das/genome/[genome]/[version]/type

Range query: return all annotations of a given type that overlap a genomic region

– http://server/das/genome/[genome]/[version]/feature?

overlaps=[seq/min:max];type=[type]

Page 12: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

DAS/2 Enhancements: Ontologies

All features are required to be described by an ontology– What is the feature?

Gene, mRNA, transposable_element…– What are attributes of the feature?

Polycistronic_mRNA, programmed_frameshift…

Sequence ontology (SO) is the default (song.sourceforge.net)

– Can be changed & extended– ~500 terms in all– Standard OBO format

Feature hierarchy allows features to be contained within others: e.g. gene->mRNA->CDS

Page 13: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

DAS/2 Enhancements: Performance

One of the biggest complaints about DAS1– Very verbose annotation XML

DAS/2 Solution #1: Refactoring annotation XML– Much smaller minimum footprint

DAS/2 Solution #2: Alternative return formats– All servers can return defined das2xml annotation

format– Servers can also specify additional return formats per

annotation type– Clients can choose from alternative formats if they

desire– Not restricted to XML, or even text– Examples: GFF3, BED, PSL, GAME– Extreme performance improvements possible

Page 14: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

DAS/2 Enhancements: Resolving Ambiguities Example: Ambiguous Range Queries

query range = x:yquery range = x:y

xx yy

Server 1 Response:Server 1 Response:

Server 2 Response:Server 2 Response:

Overlap or containment?Overlap or containment?Parent based or separate?Parent based or separate?

Server 3 Response:Server 3 Response:

Server 4 Response:Server 4 Response:

Page 15: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

DAS/2 Solution #1 – remove spec ambiguity

Specify that if parent meets region filter, also return all children

Specify whether overlap, containment, etc.

Add different region filters for different possibilities– Overlaps– Contains– Within– Identical

Allow boolean combinations of these and other filters in the query URL

Page 16: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

DAS/2 filter spec allows client query optimization

xx yy

QueryLQueryLQueryCQueryC

QueryRQueryR

LL RR

Keep track of overlap bounds of all previous queriesKeep track of overlap bounds of all previous queriesInstead of filter = “overlaps:S/x:y”, use filter = “overlaps:S/x:y; Instead of filter = “overlaps:S/x:y”, use filter = “overlaps:S/x:y; within:S/L:R”within:S/L:R”If annotation A not contained within L:R, then either:If annotation A not contained within L:R, then either:

i) bounds crosses L, in which case must overlap QueryLi) bounds crosses L, in which case must overlap QueryLii) bounds crosses R, in which case must overlap QueryRii) bounds crosses R, in which case must overlap QueryRiii) bothiii) both

Therefore if client has used this approach for all previous queries Therefore if client has used this approach for all previous queries (and restricts other filtering to single “type” filter), then for QueryC (and restricts other filtering to single “type” filter), then for QueryC no annotations will be returned that were already returned in a no annotations will be returned that were already returned in a previous queryprevious query

Page 17: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

Solution #2: DAS/2 Validation Suite

Verify whether a DAS/2 server is compliant with the specification.– Critical for improving interoperability between clients and

servers developed by different groups.

Standalone tool and web application, written in Python– Enter a URL for a DAS/2 server– Get an HTML report about DAS/2 compliance

Reference dataset– Sequences and annotations that can be loaded into a DAS/2

server for additional validation of server implementation/configuration

Source code available at: http://sourceforge.net/projects/dasypus/

Page 18: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

More DAS/2 Spec Enhancements

“Writeback” spec to allow DAS/2 clients to create and edit annotations on DAS/2 servers– Still undergoing development

IDs are URIs– Could be LSIDs or URLs– Allows for integration with many other web

technologies– xml:base

Feature hierarchies

And more…

Page 19: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

DAS/2 UML Modeling

Page 20: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

DAS/2 Reference Server

Implemented as an Apache/mod_perl 2.0 content handler – Annotations are converted to Bioperl objects and

subsequently text-transformed using Template Toolkit.

Datasources are accessible using an adaptor pattern– Current adapter is for CHADO (GMOD schema)– Soon any datasource accessible to the Generic Genome

Browser (Gbrowse) will be be accessible from the DAS/2 server.

Flatfile formats: GenBank, GFF Databases: Ensembl, GMOD/Chado, Bio::DB::GFF DAS1 web service

Source code released under Artistic License– Available via anonymous CVS as part of GMOD– See http://www.gmod.org for access details.

Page 21: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

DAS/2 Reference Client

Implemented in Java in the Integrated Genome Browser– IGB (“ig-bee”) - A visualization app developed at Affymetrix – Supports data loading via a variety of formats and

mechanisms– Full implementation of DAS/2 read client, partial

implementation of DAS/2 writeback.

Handles large amounts of genome-scale data– Loads hundreds of thousands of sequence annotations at

once– Loads dense quantitative graphs with millions of data points– Maintains real-time responsiveness to user interactions– Includes features to support exploratory data analysis– Plugin architecture for customized extensions

Source code released under Common Public License– http://genoviz.sourceforge.net

Page 22: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

Upcoming DAS/2 Developments

Writeback protocol– Ready for implementation

Registry and discovery protocol– Various alternatives have been discussed– A “playpen server” available at EBI

Page 23: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

DAS/2 & caBIG

Project 1: Add DAS/2 support to caCORE– Will enable caCORE to read genome annotations from

DAS/2 servers and re-export as caCORE objects.– Uses a flexible plug-in architecture that will be

generally useful.

Project 2: Export HapMap database as DAS/2– Will make HapMap human variation data available to

caBIG grid via caCORE.

Project 3: Export Vertebrate Promoter Database as DAS/2– Will make curated information on vertebrate

transcription factors and their binding sites available to caB IG grid via caCORE.

Page 24: DAS/2: Next Generation Distributed Annotation System Gregg Helt 1, Steve Chervitz 1, Tony Cox 2, Andrew Dalke 3, Allen Day 4, Ed Erwin 1, Ed Griffiths.

Acknowledgements

DAS & DAS2 mailing list participants!

Lincoln Stein (CSHL)

Ed Erwin, Steve Chervitz, Eric Blossom, Hari Tammara (Affymetrix)

Tony Cox, Ed Griffiths (Sanger Institute)

Allen Day, Brian O’Connor (UCLA)

Andrew Dalke (Dalke Consulting)

Suzanna Lewis (LBL)

Ann Loraine (U. of Alabama)