Top Banner
BODHI 1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of Science
18

BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

Dec 25, 2015

Download

Documents

Juliana French
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 1

BODHI,A Bio-diversity Database Pla(n)tform

Jayant HaritsaDatabase Systems Lab

Supercomputer Education and Research Centre

Indian Institute of Science

Page 2: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 2

Team

B. J. Srikanta (next talk)

Prof. Madhav GadgilProf. V. Nanjundiah(Centre for Ecological Sciences, IISc)

Several Masters Students

Funded by DBT

Page 3: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 3

Motivation

GATT – Patent Laws To be in place by 2005

Loss Neem Basmati (estimated export value: Rs. 1,198 crore) Turmeric

Global and local efforts GBIF (Global Biodiversity Information Facility) Karnataka Bio-diversity Board [Deccan Herald - Aug 26 2000]

Page 4: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 4

Bio-diversity Data

Taxonomy of species Phenetic (physical) characteristics Phylogenetic (evolutionary) characteristics

Habitat / Spatial distribution Political Layout Geographic Layout Biospheres

Genetic information Bio-molecular sequences Structural information

Page 5: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 5

MULTI-DOMAIN QUERY

Retrieve all plant species that share a common habitat, have identical Inflorescence characteristics, and have a DNA sequence within BLAST score of 80, with respect to “Michelia-champa”.

Page 6: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 6

Difficulties:

Complex range of data types sets, hierarchies, aggregations, sequences,

geometries, maps, audio, images …

Multidimensional data spatial (latitude, longitude, elevation) to

proteins (hundreds of coordinates)

Computationally-intensive operators species relationships, spatial distributions,

sequence alignments, ...

Page 7: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 7

Current Solutions

Small-Scale MS-Access / FoxPro / Excel / ... Pentium PCs

Large-Scale RDBMS: Oracle / DB2 / Informix / Sybase / … Unix servers: Sun / SGI / IBM / HP / ...

Page 8: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 8

Limitations:

RDBMS approach of “the world is a flat collection of tables with simple attributes”

suits financial applications,

NOT scientific (biological) applications In particular, taxonomic / spatial / sequence /

multimedia data modeling and processingare very cumbersome and coarse

Page 9: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 9

Limitations (contd)

Spatial and other applications are not within the database kernel but are connected externally. E.g. Many GIS systems have ArcInfo and MS-Access hooked up in a “black-box” manner. Or, Blast/FASTA utilizing sequence files generated from Oracle.

Problem: Slow and ugly!

Page 10: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 10

Is there Hope?

Object-Oriented DBMS “Natural” for biological applications

High-performance data access methods Path Dictionary Index, Multi-key Type Index,

Pyramid Tree, ...

High-performance specialized operators spatial join, data mining, sequence processing, …

XML = HTML + Semantics

Page 11: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 11

Goals of BODHI

Seamless integration of taxonomic, spatial and genomic data using OO technology

Latest access methods and operatorsfor all three types of data

Utilize XML for data exchangeLow-cost (ideally, free!)

Page 12: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 12

Architecture of BODHI

The Internet

Object Operations Genome Operations

Genome ModelSpatial Model

Spatial Operations

OBJECT STORAGE MANAGERSpatial Services Object Services Sequence Services

Taxonomy Model

Spatial Indexes Object Indexes Genome Indexes

Client Interface FrameworkQuery Processor

Page 13: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 13

Implementation of BODHI

The Internet

Inheritance Aggregation

AlignmentBLAST, FASTA

DNA, ProteinCountry, State,

City, River, Road

Overlaps, Contains,Closest, Within

SHORE MICRO-KERNEL

Spatial Services Object Services Sequence Services

Species, Genera, Family, Order

R*-tree, Hilbert-Rtree Multi-Key Type,Path-Dictionary

??? Indexes(next talk)

Client Interface Framework–DB

Basic Types (Point, Line, Polygon, Sets, Sequences, ...)

Page 14: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 15

Query Flow

Page 15: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 16

Project Status

Prototype (minus Client Interface Framework) is operational since last month !

Platform: PIII-700MHz running Redhat Linux.

For Code, contact “[email protected]

Page 16: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 17

Performance Evaluation

SEQUOIA 2000 spatial benchmark: Competitive with Paradise GIS from Wisconsin

Taxonomy + Spatial Queries: Reasonably fast

But Genomics slows things down a lot due to absence of indexes (next talk)

Page 17: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 18

More details

“Design and Implementation of a Biodiversity Information System”,Proc. of Intl. Conf. On Management of Data (COMAD), Pune, December 2000

“The Building of BODHI, A Bio-diversity Database System”,TechRep-2001-02, DSL/SERC, IISc

Available at http://dsl.serc.iisc.ernet.in

Page 18: BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of.

BODHI 19

End of Talk