Top Banner
Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University
21

Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Dec 17, 2015

Download

Documents

Lenard Horton
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Database Engines for Bioscience

John Corwin

Avi Silberschatz

Swathi Yadlapalli

Yale University

Page 2: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Database Engines for Bioscience

The Yale Center for Medical Informatics, led by Perry Miller, makes extensive use of biomedical and bioscience databases.

Large database projects include– Trial/DB– SenseLab

Page 3: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Trial/DB

Trial/DB is an open-source clinical study data management system– Manages patient data across multiple studies.– No limit on the number of patients per study.– No limit on the number of parameters that are

tracked in each study. – Currently used for a large number of clinical

studies at Yale and other universities.

Page 4: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

SenseLab

Stores integrated, multidisciplinary models of neurons and neural systems.– Cell properties (receptors, currents, and

transmitters)– Neurons– Networks of neurons (computational neuron

models)

Page 5: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Database Engines for Bioscience

What’s different about these databases?– Sparse data– Frequent schema changes– Row-level security policies– Lots of metadata

Page 6: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

EAV

Entity-Attribute-Value (EAV)– Also known as Object-Attribute-Value

Data is stored in 3-column tables– Column 1 stores the object name– Column 2 stores the attribute name– Column 3 stores the attribute value

Page 7: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

EAV – Example

Suppose we’re building a database of fruit types

Entity Attribute Value

Apple Color Red

Apple AvgWeight 8.2

Banana AvgWeight 11.5

Banana Color Yellow

Apple Potassium 159

Page 8: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

EAV/CR

EAV/CR: extends EAV with classes and relationships– Classes – the value field in each EAV triple may

contain complex, structured data– Relationships – values may refer to other objects

in the database, thus relationships in the database are stored explicitly

Page 9: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Drawbacks of EAV and EAV/CR

Since all data is conceptually in a single table, the only operations we can perform are filters and self-joins

Support for regular queries must be re-implemented on top of the EAV database

The performance of attribute-centered queries is significantly worse than a conventional database

High storage overhead for dense data

Page 10: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Our work

Extend a conventional database engine to better support the requirements of bioscience databases.– Take advantage of existing tools– Better performance– Better space-efficiency

Page 11: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Schema modification

Use 1-column tables Each attribute is stored in an individual table The original table is formed by joining the 1-column

tables by tuple index

Fruit Color AvgWeight Potassium

Apple Red 8.2 159

Banana Yellow 11.5 400

Grape Green 0.2 3

Page 12: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Schema modification

Adding and Removing attributes– One-column tables can be added and removed without

affecting the existing tables

Fruit Color AvgWeight Sodium

Apple Red 8.2 15

Banana Yellow 11.5 3

Grape Green 0.2 2

Fruit Color AvgWeight Sodium Potassium

Apple Red 8.2 15 159

Banana Yellow 11.5 3 400

Grape Green 0.2 2 3

Fruit Color AvgWeight Potassium

Apple Red 8.2 159

Banana Yellow 11.5 400

Grape Green 0.2 3

Page 13: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Schema modification

Query– Original table is formed by joining each one-

column table by tuple index

Page 14: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Schema modification

To test this new storage mechanism, we implemented it in PostgreSQL

PostgreSQL is an open-source relational database engine

– Based on the original Postgres engine developed at Berkeley

– Supports modern database features such as complex queries, transactional integrity, and extensible functions and data types

– 500,000 lines of C code

Page 15: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Dynamic Tables: Interface

Create an extension of the SQL syntax All other table operations work normally on dynamic

tables – the implementation of dynamic tables is transparent to the database user

CREATE DYNAMIC TABLE fruit_table (fruit string,avgWeight float,potassium int);

Page 16: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Dynamic Tables: Implementation

Original statement:CREATE DYNAMIC TABLE t(c1 t1, c2 t2, ..., cn tn);

Create individual tables, for 1 ≤ i ≤ n:CREATE TABLE dyn_t_i(ci ti);

Expose original table as a view:CREATE VIEW t AS (SELECT c1, c2, ..., cn FROM

dyn_t_1, dyn_t_2, ..., dyn_t_n WHERE (c1.ctid = c2.ctid) AND ... AND (cn-1.ctid = cn.ctid));

Page 17: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Dynamic Tables: Implementation

Override implementation of insert, update, and delete

Insert: Given the queryINSERT INTO t VALUES (v1, ..., vn);

Translates to

INSERT INTO dyn_t_i VALUES (v_i);

for 1 ≤ i ≤ n

Page 18: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Dynamic Tables: Implementation

Update and Delete: take advantage of PostgreSQL’s rule system to translate operations to individual tables

Must maintain the invariant that each table has the same number of entries

Page 19: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Sparse Data

Make the column index explicit Change our one-column tables to two-

column tables– Column 1: row-index of this tuple– Column 2: data value

Table is exposed by joining sub-tables by explicit row index

Dense and sparse attributes can be mixed within the same relation

Page 20: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Database Engines for Bioscience

Future work:– Row-level security– Metadata– Implement SenseLab database and compare

performance

Page 21: Database Engines for Bioscience John Corwin Avi Silberschatz Swathi Yadlapalli Yale University.

Database Engines for Bioscience

Questions?