Top Banner
Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics http://ncicb.nci.nih.gov Himanso Sahni & Scott Gustafson December 7, 2001 Science Applications International Corporation (SAIC)
31

Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics Himanso Sahni & Scott Gustafson December.

Dec 18, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Cancer Bioinformatics Infrastructure Objects (caBIO)

Architecting the Future of Genomics

http://ncicb.nci.nih.gov

Himanso Sahni & Scott GustafsonDecember 7, 2001Science Applications International Corporation (SAIC)

Page 2: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

caBIO Team

caBIO was developed as “open source” technology by the National Cancer Institute Center for Bioinformatics

(NCICB)

Software developers teamed with scientists from the NCICB Cancer Genome

Anatomy Project (CGAP) to define caBIO requirements and use cases, and

model the genomics domain

Page 3: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Purpose

“To provide an overview of caBIO, its current role in

genomics, and future transitions towards intelligent genomics”

Page 4: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

The Genomic Past

Little or no data integration Lack of standards Little or no software code re-use Duplication of efforts Duplication of data Lack of common vocabulary Maintenance nightmares

Page 5: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

The Genomic Challenge

Code Re-Use

INTELLIGENCE!http://ncicb.nci.nih.gov

Data Integration

Standards

Increased ProductivityCommon Vocabulary

Ease of Maintenance

Normalized Data

Page 6: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

caBIO Agenda

caBIO Overview Development Process Architecture Data Sources Usage Benefits Future

Page 7: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

caBIO Overview

caBIO is a standards based set of genomic components

caBIO objects simulate the behavior of actual genomic components such as genes, chromosomes, sequences, libraries, clones, ontologies, etc.

caBIO objects provide access to a variety of genomic data sources including GenBank, Unigene, LocusLink, Homologene, Ensemble, GoldenPath, and NCICB’s CGAP (Cancer Genome Anatomy Project) data repositories

caBIO is “open source”

Page 8: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Development Process

Iterative Approach– An iterative software

design and development approach leveraging concepts from RUP and XP was used

– Initial Vision and High Level Analysis form the overall goals of the project

– Iterative functional design and implementation processes allows for rapid and segmented development of the application

Page 9: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Development Process

High Level Use Cases– Initial high level Use

Cases were developed for gene components

– Additional high level Use Cases were added as additional functional areas are mapped (e.g. pathways, therapies, microarray objects, mouse models)

– Detailed Use Cases are derived from working with domain experts in requirements gathering.

Page 10: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Development Process

Detailed Use CasesDetailed Use Cases• Detailed Use Cases Detailed Use Cases

were derived from were derived from teaming with domain teaming with domain experts during experts during requirements gatheringrequirements gathering

• Detailed Use Cases Detailed Use Cases includeinclude

• Characteristic Characteristic InformationInformation

• Main Success Main Success ScenarioScenario

• ExtensionsExtensions• Related InformationRelated Information

Page 11: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Development Process

Class Diagrams– Class Diagrams

were developed using the Unified Modeling Language (UML)

– The class diagram is continually updated as new objects and relationships are defined

Page 12: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Development Process

ObjectsObjects• Each object consists of Each object consists of

many attributes, many attributes, operations, and operations, and associationsassociations

• For example a gene For example a gene object has many object has many attributes such as attributes such as symbols and keywords. symbols and keywords. A gene also has A gene also has operations including operations including getName and getName and getSequnces, and getSequnces, and associations which associations which indicate that a gene indicate that a gene “has” sequences and “has” sequences and “has a ” location “has a ” location

Page 13: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Development Process

Sequence Diagrams– Additional

methods are discovered as the objects are further explored modeling the use cases

Page 14: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Development Process

Object Development– Java Code is

written to implement the Object Model

– Documentation (API) is built from the Java Code

Page 15: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Development Process

Application Development– Applications

leveraging caBIO can be developed using multiple programming languages

Page 16: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Development Process

Example Application– caBIO was used to add

intelligence to BioCarta pathways

• Gene, pathway, and expression objects were used

– Users can select a pathway, and view pathway components that are mutated or expressed

– Users can drill down to detailed gene information

– http://cmap-prot.nci.nih.gov

Page 17: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Architecture

External Java Apps

Clients Presentation Layer Object Layer Data Sources

Browsers

Other Apps

HTML/HTTP

XML/HTTP

RDF

Internal Java Apps

Web Server

Servlet Container

JSPs

Servlets

UI Bean

XML Builder

XSLT Engine

SOAP Engine

XML Docs

DTDsXSL

Style Sheet

RMI

URLs

Flat Files

DatabasesDatabases

Genes Chromosomes

Libraries

Tissues Clusters

Object Managers

JDBC

HTTP

FTP

Agents

SOAP

Data Access Objects

Sequences

Diseases

Other

Domain Objects

Page 18: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Architecture

The core infrastructure exhibits an n-tiered architecture with client interfaces, server components, back-end objects and data sources

Clients (browsers, applications) can receive information (HTML and XML) from back-end objects over HTTP– Client applications can also communicate with back-end objects via

Java RMI (Java applications)– Non-Java based applications will communicate via SOAP

Server components communicate with back-end objects via Java RMI

Back-end objects communicate directly with data sources (database, URLs, flat files)

A UDDI registry will be configured to advertise services– RDF is currently used to advertise services to crawlers and agents

Page 19: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Architecture

Client Technology • Industry Standard Web Browsers - Netscape 4+ and IE 4+• Java Applications – Applications that implement the Java programming

language, an object-oriented language which provides portability and many other features

• Non-Java Applications - Applications usually written in Perl, C, etc.– Non-Java applications use SOAP clients to interface with caBIO

» SOAP::Lite for Perl – A collection of Perl modules which provides a simple and lightweight interface to the Simple Object Access Protocol (SOAP)

• Agents - Software programs that perform a function for a user in a trusted fashion, can learn or be taught its function, and can perform actions for the user without permission

• RDF – The Resource Description Framework is a foundation that advertises Web services via the Web. RDF is used to describe the content and services available at a particular Web site. (passive)

• UDDI – Universal Description, Discovery and Integration is a foundation to enable businesses to discover and transact business with each other using preferred applications (active)

Page 20: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Architecture

Presentation Layer Technology– Jakarta Tomcat - Servlet+JSP Engine which is a subproject of

the Jakarta Project– JSPs - Java Server Pages are web pages with Java embedded

in the HTML to incorporate dynamic content in the page– Java Servlets – Server-side Java programs that web servers can

run to generate content in response to client requests– Java Beans – Reusable software components that work with

Java– XML – The Extensible Markup Language is a universal format for

structured data on the Web• XSL/XSLT – The Extensible Stylesheet Language is a language for

expressing stylesheets. XSL Transformations (XSLT) is a language for transforming XML documents

• XLink – The XML Linking Language allows elements to be inserted into XML documents in order to create and describe links between resources

• DOM – The Document Object Model is a platform and language independent interface that allows programs to dynamically access content, structure, and document style

Page 21: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Architecture

Presentation Layer Technology(cont.)– SOAP – The Simple Object Access Protocol is a

lightweight XML based protocol for the exchange of information in a decentralized, distributed environment. It consists of an envelope that describes the message and a framework for message transport

• Apache SOAP – An implementation of the W3C SOAP specification. The Apache SOAP provides a server-side infrastructure for deploying, managing, and running SOAP enabled services.

Page 22: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Presentation Layer– The Mediator Servlet manages

the user session, forwards requests to the JSP, and returns HTML

– JSPs forward requests to the UI Bean, and returns XML or HTML to the client

– The UI Bean receives the a Domain Object and converts the object to XML or HTML

– The DOM Writer performs the Conversion to XML while the Conversion Servlet represents the Domain Object as a Document Object (DOM)

• The conversion is performed by calling the Domain Objects toXML() method

Architecture

XML Builder

HTML/HTTP

XML/HTTP

RDF

RMI

SOAP

Mediator Servlet

ConversionServlet

JSPs

DOM Writer

UI Bean Manager Proxies

Domain Objects

Page 23: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Architecture

Object Layer Technology– Java Applications/Objects – Applications/Objects that implement

the Java programming language, an object-oriented language which provides portability and many other features

– Java RMI – Remote Method Invocation is distributed computing in Java. It is a simple technology for object communication that allows remote objects to act as local Java objects

– JDBC – Java Database Connectivity is a Java API for executing SQL statements and connecting to databases

– SOAP – Simple Object Access Protocol extends object interoperability to other programming languages (Perl, Python, C++)

– DAS – The Distributed Annotation System is an emergent system for retrieving genomic annotations from a variety of data sources (GoldenPath, Ensembl)

Page 24: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Object Managers

Object Layer– Object Managers are

created to implement complex scientific concepts

– The Role Model object verifies user permissions to the data if applicable

– The Remote Manager interface abstracts the RMI layer

• Allows RMI to be easily replaced by EJB or other communication venues

– Domain Objects are serialized and XML enabled

Architecture

RMI

Role Model

Remote Interface Manager

GeneManager

SequenceManager

LibraryManager

Other Managers

Data Access Objects

Domain Objects

Page 25: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Data Access Objects– Provides the object

relational mapping

– Database independent persistence

– Allows for the introduction of new data sources

– Abstracts persistence details from the domain objects

– Allows for federated database topology

LibraryPersistence

GenePersistence

TissuePersistence

SequencePersistence

Other

ClonePersistence

ProteinPersistence

Architecture

Data Sources

URLs

Flat Files

DatabasesDatabases

JDBC

HTTP

FTP

Data Access Objects

Page 26: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Data Sources

CGAP (NCI Cancer Genome Anatomy Project)– Provides gene expression profiles of normal, pre-

cancer, and cancer cells GenBank (NCBI)

– Sequence submission software Unigene (NCBI)

– Partitions sequences into unique gene-oriented clusters

LocusLink (NCBI)– Provides an interface to curated sequence and

descriptive info about genetic loci

Page 27: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Data Sources

Homologene (NCBI)– Provides curated orthologs and homologs

for UniGene and LocusLink genes GoldenPath (UCSC)

– Contains a working draft of the human genome

Page 28: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Usage

Java developers use the caBIO jar file for immediate access to genomic object attributes and methods– caBIO objects can easily be extended to support customized

objects Non-Java and Java developers can use SOAP to

access genomic object attributes and methods over HTTP

Developers can obtain detailed descriptions of available services via the RDF site– URIs of desired services can be obtained via RDF

Developers can host a caBIO server or leverage the existing NCICB caBIO server

Page 29: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Benefits

Provides an abstraction layer that allows developers to access genomic information using a standardized tool set without concerns for implementation details

Permits access to allow developers to obtain the information they need from a variety of data sources without data management

Manages the display of large volumes of data to assist in load balancing

Provides an effective mechanism for comparing similar objects that rely on diverse data sources

Facilitates information sharing without managing linkages between multiple data sources

Page 30: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Future

The Semantic Web– Concept

• “The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee, James Hendler, Ora Lassila (W3C)

– Implementation• Agent Based Technology may be used to create programs that

implement semantic web concepts

Natural Language Processing (NLP) – NLP involves understanding natural language and extracting

information – NLP can be used as a tool to help automate tasks, refine

searches and improve genomic algorithms

Page 31: Cancer Bioinformatics Infrastructure Objects (caBIO) Architecting the Future of Genomics  Himanso Sahni & Scott Gustafson December.

Questions?