Genomics Algebra A New, Integrating Data Model, Language, and Tool for Processing and Querying Genomic Information Joachim Hammer and Markus Schneider University of Florida CIDR 2003 Asilomar, CA Jan. 5-8, 2003
Jan 27, 2016
Genomics AlgebraA New, Integrating Data Model,
Language, and Tool for Processing and Querying Genomic Information
Joachim Hammer and Markus SchneiderUniversity of Florida
CIDR 2003
Asilomar, CA Jan. 5-8, 2003
Joachim Hammer CIDR 2003 2
Overview
Data Management Problems in Bioinformatics
Proposed Solution Genomics Algebra and Unifying Database
Summary and Expected Impact
Joachim Hammer CIDR 2003 3
Bioinformatics Growing field of problems in biological
sciences that require application of computing and mathematics Bioinformatics was coined in mid 80’s
Genome Projects Construct detailed genetic and physical maps
of a variety of organisms E.g., human genome project
Functional Genomics What do genes do and how do they interact? E.g., drug discovery, agro-food,
pharmacogenomics (individualized medicine)
Joachim Hammer CIDR 2003 4
Why is Bioinformatics Important?
Acquiring sequences is first step … Ultimate goal is to decipher structural, functional,
evolutionary information encoded in language of biological sequences
Alphabet (amino acids), words (motifs), sentences (proteins)
Decoding an unknown language To date, unable to predict structure (i.e., words
and sentences) from sequence Mostly pattern-matching techniques: detect similarity
between sequences and infer related structures and functions
Number of experimentally determined protein structures is VERY small
Joachim Hammer CIDR 2003 5
An Information Revolution …
Emergence of rapid DNA sequencing and high throughput gene analysis techniques
Flood of genomic data Nucleic acid and protein sequences, motifs,
folding units, modules, interaction information, etc.
Complex data, e.g., sequential lists, deeply nested record structures, image & video data
Data stored in more than 500 repositories E.g., EMBL (150 GB, 2001), GenBank, SWISS-
PROT, SANGER Centre (20TB, 2001), … Sequence repositories increase 4x per year Known sequence data outweighs protein
structural data ~100:1 (sequence/structure deficit)
Joachim Hammer CIDR 2003 8
… and the Resulting Problems for Biologists
Scientists are overwhelmed by data which is awaiting further refinement and analysis
Number and size of available data sources continuously growing
Overlap and conflicting information Proliferation of interfaces and portals Familiar sources sometimes disappear or get merged
Little or no agreement on terminology Unmanageable query results Forced to understand low-level data
management Often required to learn and write SQL or code in some
other programming language (Perl) Noisy data
E.g., estimated that 30-60% of sequences in GenBank are erroneous
Joachim Hammer CIDR 2003 9
Corresponding CS Problems
Management of heterogeneous, autonomous sources Missing standard for genomic data representation Formatted files prevail over conventional
database representations (few sources use DBMSs)
Lots of redundancies and inconsistencies Many different interfaces (e.g., Web-based,
specialized GUIs and retrieval packages) Query languages not suitable for intended
users Limited interaction functionality of repositories Query results are often unmanageable
Joachim Hammer CIDR 2003 10
CS Problems Cont’d Low-level treatment of data
Users manipulate strings and integers instead of genes and sequences
No high-level operations either Lack of extensibility of software managing
sources Not possible to integrate new, specialty evaluation
functions Extraction of new knowledge from existing
sources without much computational support Integration of new knowledge into repositories is
tedious E.g., no personal scratch pad that can be integrated
with existing data Dealing with uncertainty and erroneous data
E.g., frameshift problem
Joachim Hammer CIDR 2003 11
State-of-the-Art Current research is focused mainly on
integrating existing repositories Federated and query-driven approaches (e.g., SRS,
BioNavigator, DiscoveryLink, K2/Kleisli, Tambis, …) Work on standardizing terminology and representations
(e.g., Gene Ontology, EcoCyc, …) Analysis is performed outside of the repositories
Sequence similarity search: e.g., Basic Local Alignment Search (BLAST) and its derivatives, …
Visualization tools: e.g., BEAUTY, BioWidgets, … Complex middleware tiers between end-users
and the data servers Inefficient, lots of user involvement (human query
processor)
Joachim Hammer CIDR 2003 12
Iterative Query and Analysis
While not done …
Construct a database query
Store query output
Analyze query results
Done?
Query Relevant Database(s)
Store Query Output
Analyze Output
Joachim Hammer CIDR 2003 13
Fundamental Challenge
Development of a more principled approach to genomic data management Leverage capabilities provided by
modern DBMS Services tightly integrated Shields scientists from knowing low-
level data management details as much as possible
Joachim Hammer CIDR 2003 14
Integrating Approach to Genomics Data Management
Extensible Genomics Algebra Formal data model, query language, and
software for representing, storing, retrieving, querying, and manipulating genomic information
Provides a set of high-level genomic data types (GDTs) together with genomic operations or functions
Unifying Database Persistent storage for high-level, structured
GDT values of Genomics Algebra Warehouse for data from existing genomic
repositories
Joachim Hammer CIDR 2003 15
Mini Genomics Algebratypes
codon, aminoAcid, gene, primaryTranscript, mRNA, protein
operatorsdecode: codon aminoAcid“given a codon, computes the corresponding amino acid”
transcribe: gene primaryTranscript“given a gene, returns its primary transcript”
splice: primaryTranscript mRNA“given a primary transcript, removes its introns to produce the mRNA”
translate: mRNA protein“given a messenger RNA, determines the corresponding protein”..
Joachim Hammer CIDR 2003 16
What Can We Do with a Genomics Algebra?
Can use the algebra to formally express existing biological operations E.g., Given DNA fragment and sequence,
returns true if fragment contains specified sequencecontains(frag,“ATTGCCATA”)
Create new operations using function composition E.g., express central dogma of molecular
biology astranslate(splice(transcribe(g)))
Joachim Hammer CIDR 2003 17
Research Challenges What data types and operations do we need?
Need comprehensive ontology defining terminology, data objects, and operations
Formalize definition of GDTs and operations Vague or lacking knowledge of many biological
processes makes this hard Implement algebra
Design of data structures and efficient algorithms for genomic operations
Must be extensible Suitable for integration with a database system
Joachim Hammer CIDR 2003 18
Unifying Database Persistent storage manager for Genomics
Algebra Integrated repository (warehouse) for
genomics sources GUS (U Penn) is only other known genomics
warehouse prototype system Provides superior query processing
performance in multi-source environments Ability to maintain and annotate extracted
source data after it has been cleansed, reconciled and corrected Option to preserve historical data from those
repositories that do not archive their contents
Joachim Hammer CIDR 2003 19
Integrated System Architecture
public space
userspace
userspace
userspace
…
Unifying Database
Extensible DBMS (Oracle, DB2, …)
GenomicsAlgebra
GUI
DBMS-specific Adapter
External Repositories(e.g, GenBank, NCBI, …)
ETL
…
Joachim Hammer CIDR 2003 20
Implementation Adapter provides DBMS-specific coupling
mechanism between Genomics Algebra and DBMS Use UDT mechanism (opaque types and user-
defined operators linked as external functions) Supported by all major DB vendors
User interface component consisting of Biological query language together with
graphical output XML application as standardized exchange
format for sharing genomics data
Joachim Hammer CIDR 2003 21
Research Challenges Design of the integrated schema
Iterative process with input from domain experts
Detecting changes in underlying sources Push capabilities are slowly being offered Tools for computing what has changed
Database maintenance View maintenance problem Derived data (annotations) based on update
must be recomputed Knowing provenance of data could be used to
determine which annotations need to be recomputed
Joachim Hammer CIDR 2003 22
Vision and Expected Impact
Advocate a “back to the roots” strategy of database technology for bioinformatics
Fundamental change in way biologists analyze data Single interface specifically designed for biologists No need to become “computer scientists”
New knowledge about design and implementation of biological type system and its operations Demonstrate extensibility of modern DBMS Help development of algebras for other
applications