Top Banner
The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics University of Texas Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang
40

The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Dec 27, 2015

Download

Documents

Kathlyn Flowers
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

The MoBIoS ProjectMolecular Biological Information System

Daniel P. MirankerDept. of Computer Sciences &

Center for Computational Biology and Bioinformatics

University of Texas

Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang

Page 2: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Problem:

In Life Sciencses, database management systems (DBMS) serve as glorified file managers.

Little use of sophisticated data and pattern-based retrieval

Real scientific and technological problems

Page 3: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

When biological data is put in to an RDBMS

• Primary data is stored in text or blob fields– Annotations may be relational

• Data retrieval – Filter DB, sequential dump, O(n), to utilities

• E.g. BLAST,

Organism Function Sequence

Yeast membrane AACCGGTTT

Yeast mitosis TATCGAAA

E. Coli membrane AGGCCTA

Page 4: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Linear Data Scans, O(n), Endemic in Life Sciences

Sequences: DNA, RNA, Protein databases

Mass Spectra proteomics

Small Molecules & Protein Structure Protein interaction Rational drug design

Pathways (graphs) Phylogenies (graphs, trees in particular)

Page 5: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Scope: To Find Common Ground Both Biology and DBMS’ Have to Move

DBMS

Biological

Information

System

Metric-Space Database as the Common Ground

Page 6: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Metric Space is a pair, M=(D,d),

where D is a set of points d is [metric] distance function with the following

properties:

d(x,y) = d (y,x) (symmetry) d(x, y) > 0, d(x,x) = 0 (non negativity) d(x,z) <= d(x,y) + d(y,z) (triangle inequality)

x

y z

Page 7: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Definition - By Analogy

A Spatial Database Management System:

Extend relational DBMS Special indexes for 2D and

3D data; k-d and R-trees New data types

Geographic information systems Topographic maps Buildings and the like

A Metric-Space Database Management System

Extend Relational DBMS Special indexes for metric-

spaces New data types

Biological information system Life science data types

Page 8: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Develop index structures to support distance & nearest-neighbor queries

• Well studied in main-memory– But by no means a closed problem

• In databases (external/disk based methods)– Embryonic– Many myths

• Often assumed to be the basis of multimedia database systems

Page 9: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

How to build a metric-space index

• Three algorithmic classes [Tasan, Ozsoyoglu 04]

– Vantage points– Hyperplanes– Bounding spheres

Page 10: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Vantage Point Method [Burkhard&Keller73]

Page 11: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Vantage Point Method

Choose a point,VP

And a radius, R

Page 12: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Vantage Point Method

Choose a point,VP

And a radius,R

• Given VP, R

The predicates

• d(VP,x) < R

• d(VP,x) R

Divide the set into two equal halves

• apply recursively

Page 13: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Query, q, range r

qr

Page 14: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Query, q, range r

VP

R

q

r

if• d(q,VP) > R + rthen• all neighbors are outside the sphere

Page 15: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Multi-vantage point method

Page 16: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Multi-vantage point method

• Consider d(VPi, x) a projection onto an axis

• Looks like a k-d tree– Choose number k & d

Page 17: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Myths

• Solved problem; M-trees [Ciaccia et.al. 96, 97]

– I can’t get them to work on anything but their original synthetic data generator

• Good choice for vantage points is to find corners[Yianilos93] (farthest-first clustering)– Might be true for euclidean spaces– Early result, not true for our data

• High dimensional indexing always asymptotically reduces to linear scans.– Formal result based on an assumption of uniform data

distributions.

Page 18: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

#di st . cal . : RBT VS. GHT VS. MVPT

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 2 4 6 8 10radi us

#dist cal.

RBTGHTMVPT

#I / O, RBT VS. GHT. VS MVPT

0

100

200

300

400

500

600

700

800

0 2 4 6 8 10radi us

#IO

RBT

GHT

MVPT

Figure 9. Comparison of metric-space index structures: RBT, GHT, and VPT

Comparison of Three Methods of Metric-Space Indexing

Page 19: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Open problems

• Is there a general metric-space index structure that is generally good for most work loads.– We are optimistic mvp tree’s – further tuning will be a

useful answer

– Hyperplane methods are fair game – there is circumstantial evidence that that is key component in Google’s search engine.

• No work addresses clustering data pages on disk.• Metric-space join algorithms

Page 20: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Biological Models are Usually Based on Similarity

Similarity• Biologist like scoring functions that reward each

similar feature with a positive number• Intuitive

Distance:• More Similar smaller numbers• Identical 0

Page 21: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

But Do Metric Models Capture Biology?But Do Metric Models Capture Biology? • Metrics are a subset of possible mathematical models

.

Page 22: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Sequence Problem 1

Sequence similarity based on weighted edit distance

Accepted weight matrices, PAM & BLOSSUM, are not metric

Log-odd matrices – negative values

Defy simple algebraic normalization[TaylorJones93,Linialetal97]

Page 23: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Our First Result: mPAM [Xu&Miranker04]

Dayhoffetal’s PAM Derivation[74]

• Took a set of closely related protein sequences

• Developed a phylogenetic tree

• Counted substitutions to transform one sequence to another

• Tree determines a measure of time

Page 24: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

PAM vs. mPAM: t = 1/f

Using original substitution counts

PAM: frequency of substitution

S(a,b|t) = log P(b|a,t)/qb

mPAM: expected time between substitutions

D(a,b) = 1/log(1 – (P(a,x)P(b,x))x

Page 25: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Sequence Problem 2

• Sequences long units (identity for storage and retrieval)– Genes– Chromosomes

• Analysis comprises comparing small substrings

Page 26: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Soln: Sequence View

• New view type

• Breaks sequences into q-grams

create SEQUENCEVIEW rice_sview asSELECT CREATE FRAGMENTS (…, 3, 1)FROM …WHERE …

USING HAMMING-DISTANCE

Page 27: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Materialize as an Index

Genomes

Rowid Seq

R1 CAACA

R2 ATCAAA

R3 …

Rowd Offset Logical Fragment

R1 1 A C A

R1 2 C A A

R1 3 A A C

R1 4 A C A

… … …

R2 1 A T C

R2 2 T C A

R2 3 C A A

R2 4 A A A

… … …

D(ACA)

≤ 1D(CAA)

≤ 0D(ATC)

≤ 1

D(AAA)≤ 2

{

{

Page 28: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Status

• Started with McKoi– A Java open source object-relational DBMS– (Think of Postgress written in Java)

• AddedBiological data typesMetric-space indexExtending SQL engine (in progress)

Page 29: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Computed in MoBIoS Compare Arabidopsis Genome X Rice Genome

1. Locate nucleotide patterns of form

primer pair candidate

2. Eliminate non-unique primer candidates3. Merge overlapping primer candidates

• Usual implementations O(n2), n = 109

Rice

Arab.

18 Matching Nucleotides

Rice Gap 400 – 3000 Long Arab. Gap 400 – 3000 Long

18 Matching Nucleotides

Page 30: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

mSQL Query to locate candidate primer pairsSELECT merge(R1.fragment, A1.fragment)

FROM

G1_sview R1, G1_sview R2, G2_sview A1, G2_sview A2

WHERE

distance(‘HAMMINGDISTANCE', R1.fragment, A1.fragment) <= 1.0 AND distance(‘HAMMINGDISTANCE', R2.fragment, A2.fragment) <= 1.0 AND

(FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) >= 400 AND

(FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) <= 3000 AND

(FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) >= 400 AND

(FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) <= 3000

GROUP BY R1.fragment, A1.fragment;

Page 31: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Query Plan Arab. Genome, O(n) Rice Genome, O(m)

Offline: Build Sequence View O(n log n)

Compare O(mlogn) Indexed Nested Loop

Eliminate Duplicates

Eliminate Low ComplexityPrimers (LZ compression)

Merge Overlapping Primers

~10,000 conserved primer pairs candidates

Page 32: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Preliminary Results• Found 13,418 possible primer pairs from MoBIoS• 100 best candidates BLASTed for matches in GenBank

– 15 matched other plant genes and the primers– At least 2 of 15 showed potential after PCR amplification against

Helianthus and Phalaenopsis.

Page 33: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

MoBIoS Architecture(Molecular Biological Information System)

Page 34: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Analysing Mass-Spectra

Spectrum = Histogram of Mass/Charge Ratios of a collection peptides

Similarity = Shared peaks count = Inner Product

(0100101) • (0111100) = 2

Page 35: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Cosine Distance Approx. Inner Product

Drs= 1 – xrx’s/(x’rxr)1/2(x’sxs)1/2

shown store and retrieve mass-spectra

- using cosine distance, and it scales

Page 36: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

mSQL Query for Protein Identification by Mass-Spec.

Signature Database Look

SELECT Prot.accesion_id, Prot.sequenceFROM protein_sequences Prot, digested_sequences DS,

mass_spectra MS

WHEREMS.enzyme = DS.enzyme = E and

Cosine_Distance(S, MS.spectrum, range1) and

DS.accession_id = MS.accession_id = Prot.accesion_id and

DS.ms_peak = P and MPAM250(PS, DS.sequence, range2);

Page 37: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Matching Electrostatic Shape of Molecules

Page 38: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Still benefit from grid-services: Intermittently, but regularly compile (recluster) the indices O(nlog n), n > 106

Rational drug design: O(log n) finite element solutions to traverse search tree. Make a service call to the grid for these operations only Mirror data contents to minimize I/O Since need is intermittant, one grid serves many MoBIoS servers

G R I D

Mirror DB-Contents

MoBIoSServer

recluster

New index Shape match (FEM)

Distance(real)

High speed I/O

Page 39: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Hyper-planes [Ulhmann91]

• If d(x,h1) < d(x,h2) then x assigned to h1h1

h2

x

Page 40: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics.

Develop a Hierarchical Clustering

Hierarchy of Bounding spheres, (center, radius), • Bounding spheres may overlap

• Inspired by R-trees

B

F D

EA

C