The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics University of Texas Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang
40
Embed
The MoBIoS Project Mo lecular B iological I nformation S ystem
M o B I o S M o B I o S. S o I B o M S o I B o M. The MoBIoS Project Mo lecular B iological I nformation S ystem. Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics University of Texas. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The MoBIoS ProjectMolecular Biological Information System
Daniel P. MirankerDept. of Computer Sciences &
Center for Computational Biology and Bioinformatics
University of Texas
Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang
Problem:
In Life Sciencses, database management systems (DBMS) serve as glorified file managers.
Little use of sophisticated data and pattern-based retrieval
Real scientific and technological problems
When biological data is put in to an RDBMS
• Primary data is stored in text or blob fields– Annotations may be relational
• Data retrieval – Filter DB, sequential dump, O(n), to utilities
• E.g. BLAST,
Organism Function Sequence
Yeast membrane AACCGGTTT
Yeast mitosis TATCGAAA
E. Coli membrane AGGCCTA
Linear Data Scans, O(n), Endemic in Life Sciences
Sequences: DNA, RNA, Protein databases
Mass Spectra proteomics
Small Molecules & Protein Structure Protein interaction Rational drug design
Pathways (graphs) Phylogenies (graphs, trees in particular)
Scope: To Find Common Ground Both Biology and DBMS’ Have to Move
DBMS
Biological
Information
System
Metric-Space Database as the Common Ground
Metric Space is a pair, M=(D,d),
where D is a set of points d is [metric] distance function with the following
Geographic information systems Topographic maps Buildings and the like
A Metric-Space Database Management System
Extend Relational DBMS Special indexes for metric-
spaces New data types
Biological information system Life science data types
Develop index structures to support distance & nearest-neighbor queries
• Well studied in main-memory– But by no means a closed problem
• In databases (external/disk based methods)– Embryonic– Many myths
• Often assumed to be the basis of multimedia database systems
How to build a metric-space index
• Three algorithmic classes [Tasan, Ozsoyoglu 04]
– Vantage points– Hyperplanes– Bounding spheres
Vantage Point Method [Burkhard&Keller73]
Vantage Point Method
Choose a point,VP
And a radius, R
Vantage Point Method
Choose a point,VP
And a radius,R
• Given VP, R
The predicates
• d(VP,x) < R
• d(VP,x) R
Divide the set into two equal halves
• apply recursively
Query, q, range r
qr
Query, q, range r
VP
R
q
r
if• d(q,VP) > R + rthen• all neighbors are outside the sphere
Multi-vantage point method
Multi-vantage point method
• Consider d(VPi, x) a projection onto an axis
• Looks like a k-d tree– Choose number k & d
Myths
• Solved problem; M-trees [Ciaccia et.al. 96, 97]
– I can’t get them to work on anything but their original synthetic data generator
• Good choice for vantage points is to find corners[Yianilos93] (farthest-first clustering)– Might be true for euclidean spaces– Early result, not true for our data
• High dimensional indexing always asymptotically reduces to linear scans.– Formal result based on an assumption of uniform data
distributions.
#di st . cal . : RBT VS. GHT VS. MVPT
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 2 4 6 8 10radi us
#dist cal.
RBTGHTMVPT
#I / O, RBT VS. GHT. VS MVPT
0
100
200
300
400
500
600
700
800
0 2 4 6 8 10radi us
#IO
RBT
GHT
MVPT
Figure 9. Comparison of metric-space index structures: RBT, GHT, and VPT
Comparison of Three Methods of Metric-Space Indexing
Open problems
• Is there a general metric-space index structure that is generally good for most work loads.– We are optimistic mvp tree’s – further tuning will be a
useful answer
– Hyperplane methods are fair game – there is circumstantial evidence that that is key component in Google’s search engine.
• No work addresses clustering data pages on disk.• Metric-space join algorithms
Biological Models are Usually Based on Similarity
Similarity• Biologist like scoring functions that reward each
similar feature with a positive number• Intuitive
Distance:• More Similar smaller numbers• Identical 0
But Do Metric Models Capture Biology?But Do Metric Models Capture Biology? • Metrics are a subset of possible mathematical models
.
Sequence Problem 1
Sequence similarity based on weighted edit distance
Accepted weight matrices, PAM & BLOSSUM, are not metric
DS.accession_id = MS.accession_id = Prot.accesion_id and
DS.ms_peak = P and MPAM250(PS, DS.sequence, range2);
Matching Electrostatic Shape of Molecules
Still benefit from grid-services: Intermittently, but regularly compile (recluster) the indices O(nlog n), n > 106
Rational drug design: O(log n) finite element solutions to traverse search tree. Make a service call to the grid for these operations only Mirror data contents to minimize I/O Since need is intermittant, one grid serves many MoBIoS servers
G R I D
Mirror DB-Contents
MoBIoSServer
recluster
New index Shape match (FEM)
Distance(real)
High speed I/O
Hyper-planes [Ulhmann91]
• If d(x,h1) < d(x,h2) then x assigned to h1h1
h2
x
Develop a Hierarchical Clustering
Hierarchy of Bounding spheres, (center, radius), • Bounding spheres may overlap