Joachim Hammer and Markus Schneider University of Florida CIDR 2003 Asilomar, CAJan. 5-8, 2003

Genomics AlgebraA New, Integrating Data Model,

Language, and Tool for Processing and Querying Genomic Information

Joachim Hammer and Markus SchneiderUniversity of Florida

CIDR 2003

Asilomar, CA Jan. 5-8, 2003

Joachim Hammer CIDR 2003 2

Overview

Data Management Problems in Bioinformatics

Proposed Solution Genomics Algebra and Unifying Database

Summary and Expected Impact


Bioinformatics Growing field of problems in biological

sciences that require application of computing and mathematics Bioinformatics was coined in mid 80’s

Genome Projects Construct detailed genetic and physical maps

of a variety of organisms E.g., human genome project

Functional Genomics What do genes do and how do they interact? E.g., drug discovery, agro-food,

pharmacogenomics (individualized medicine)


Why is Bioinformatics Important?

Acquiring sequences is first step … Ultimate goal is to decipher structural, functional,

evolutionary information encoded in language of biological sequences

Alphabet (amino acids), words (motifs), sentences (proteins)

Decoding an unknown language To date, unable to predict structure (i.e., words

and sentences) from sequence Mostly pattern-matching techniques: detect similarity

between sequences and infer related structures and functions

Number of experimentally determined protein structures is VERY small


An Information Revolution …

Emergence of rapid DNA sequencing and high throughput gene analysis techniques

Flood of genomic data Nucleic acid and protein sequences, motifs,

folding units, modules, interaction information, etc.

Complex data, e.g., sequential lists, deeply nested record structures, image & video data

Data stored in more than 500 repositories E.g., EMBL (150 GB, 2001), GenBank, SWISS-

PROT, SANGER Centre (20TB, 2001), … Sequence repositories increase 4x per year Known sequence data outweighs protein

structural data ~100:1 (sequence/structure deficit)


… and the Resulting Problems for Biologists

Scientists are overwhelmed by data which is awaiting further refinement and analysis

Number and size of available data sources continuously growing

Overlap and conflicting information Proliferation of interfaces and portals Familiar sources sometimes disappear or get merged

Little or no agreement on terminology Unmanageable query results Forced to understand low-level data

management Often required to learn and write SQL or code in some

other programming language (Perl) Noisy data

E.g., estimated that 30-60% of sequences in GenBank are erroneous


Corresponding CS Problems

Management of heterogeneous, autonomous sources Missing standard for genomic data representation Formatted files prevail over conventional

database representations (few sources use DBMSs)

Lots of redundancies and inconsistencies Many different interfaces (e.g., Web-based,

specialized GUIs and retrieval packages) Query languages not suitable for intended

users Limited interaction functionality of repositories Query results are often unmanageable


CS Problems Cont’d Low-level treatment of data

Users manipulate strings and integers instead of genes and sequences

No high-level operations either Lack of extensibility of software managing

sources Not possible to integrate new, specialty evaluation

functions Extraction of new knowledge from existing

sources without much computational support Integration of new knowledge into repositories is

tedious E.g., no personal scratch pad that can be integrated

with existing data Dealing with uncertainty and erroneous data

E.g., frameshift problem


State-of-the-Art Current research is focused mainly on

integrating existing repositories Federated and query-driven approaches (e.g., SRS,

BioNavigator, DiscoveryLink, K2/Kleisli, Tambis, …) Work on standardizing terminology and representations

(e.g., Gene Ontology, EcoCyc, …) Analysis is performed outside of the repositories

Sequence similarity search: e.g., Basic Local Alignment Search (BLAST) and its derivatives, …

Visualization tools: e.g., BEAUTY, BioWidgets, … Complex middleware tiers between end-users

and the data servers Inefficient, lots of user involvement (human query

processor)


Iterative Query and Analysis

While not done …

Construct a database query

Store query output

Analyze query results

Done?

Query Relevant Database(s)

Store Query Output

Analyze Output


Fundamental Challenge

Development of a more principled approach to genomic data management Leverage capabilities provided by

modern DBMS Services tightly integrated Shields scientists from knowing low-

level data management details as much as possible


Integrating Approach to Genomics Data Management

Extensible Genomics Algebra Formal data model, query language, and

software for representing, storing, retrieving, querying, and manipulating genomic information

Provides a set of high-level genomic data types (GDTs) together with genomic operations or functions

Unifying Database Persistent storage for high-level, structured

GDT values of Genomics Algebra Warehouse for data from existing genomic

repositories


Mini Genomics Algebratypes

codon, aminoAcid, gene, primaryTranscript, mRNA, protein

operatorsdecode: codon aminoAcid“given a codon, computes the corresponding amino acid”

transcribe: gene primaryTranscript“given a gene, returns its primary transcript”

splice: primaryTranscript mRNA“given a primary transcript, removes its introns to produce the mRNA”

translate: mRNA protein“given a messenger RNA, determines the corresponding protein”..


What Can We Do with a Genomics Algebra?

Can use the algebra to formally express existing biological operations E.g., Given DNA fragment and sequence,

returns true if fragment contains specified sequencecontains(frag,“ATTGCCATA”)

Create new operations using function composition E.g., express central dogma of molecular

biology astranslate(splice(transcribe(g)))


Research Challenges What data types and operations do we need?

Need comprehensive ontology defining terminology, data objects, and operations

Formalize definition of GDTs and operations Vague or lacking knowledge of many biological

processes makes this hard Implement algebra

Design of data structures and efficient algorithms for genomic operations

Must be extensible Suitable for integration with a database system


Unifying Database Persistent storage manager for Genomics

Algebra Integrated repository (warehouse) for

genomics sources GUS (U Penn) is only other known genomics

warehouse prototype system Provides superior query processing

performance in multi-source environments Ability to maintain and annotate extracted

source data after it has been cleansed, reconciled and corrected Option to preserve historical data from those

repositories that do not archive their contents


Integrated System Architecture

public space

userspace

userspace

userspace

…

Unifying Database

Extensible DBMS (Oracle, DB2, …)

GenomicsAlgebra

GUI

DBMS-specific Adapter

External Repositories(e.g, GenBank, NCBI, …)

ETL

…


Implementation Adapter provides DBMS-specific coupling

mechanism between Genomics Algebra and DBMS Use UDT mechanism (opaque types and user-

defined operators linked as external functions) Supported by all major DB vendors

User interface component consisting of Biological query language together with

graphical output XML application as standardized exchange

format for sharing genomics data


Research Challenges Design of the integrated schema

Iterative process with input from domain experts

Detecting changes in underlying sources Push capabilities are slowly being offered Tools for computing what has changed

Database maintenance View maintenance problem Derived data (annotations) based on update

must be recomputed Knowing provenance of data could be used to

determine which annotations need to be recomputed


Vision and Expected Impact

Advocate a “back to the roots” strategy of database technology for bioinformatics

Fundamental change in way biologists analyze data Single interface specifically designed for biologists No need to become “computer scientists”

New knowledge about design and implementation of biological type system and its operations Demonstrate extensibility of modern DBMS Help development of algebras for other

applications

Joachim Hammer and Markus Schneider University of Florida CIDR 2003 Asilomar, CAJan. 5-8, 2003

Documents