CS Guest Lecture 2015 10-05 advanced databases

Background

Golden Helix- Founded in 1998- Genetic association software- Analytic services- Hundreds of users worldwide- Over 900 customer citations in scientific

journals

Products I Build with My Team- SNP & Variation Suite (SVS)

- SNP, CNV, NGS tertiary analysis- Import and deal with all flavors of upstream data

- VarSeq- Annotate and filter variants in gene panels, exomes and

genomes for clinical labs and researchers.- GenomeBrowse (Free!)

- Visualization of everything with genomic coordinates. All standardized file formats.

Database Trends

VarSeq

Tertiary analysis to report in one click

Focused and actionable data

Modeled on ACMG guidelines

Hereditary and cancer templates

OMIM included

VSReports

Command line runner Integrate with your current

bioinformatics pipeline Create repeatable clinical

workflows for CLIA and CAP certified analysis

Supports high throughput scenarios

VSPipeline

Transactions Disk structure optimized Fixed schema SQL matures Small mem footprints Master <-> Slaves Threaded / Locking Expensive large

mainframes/servers

90s - SQL

Scale out First class sharding Utalize cheap memory Don’t let disk be

bottleneck Support stream /

distributed analytics

10s - NewSQL

“Web Scale” - distributed Eventually consistent Schema-less, key-based Avoid joins Peer-to-peer Memory cheap Many cheap commodity

servers in datacenter configurations

00s - NoSQL

> SELECT * FROM trends GROUP BY decade;

The “Database” Market in Thirds

VarSeq

Tertiary analysis to report in one click

Focused and actionable data

Modeled on ACMG guidelines

Hereditary and cancer templates

OMIM included

VSReports

Command line runner Integrate with your current

bioinformatics pipeline Create repeatable clinical

workflows for CLIA and CAP certified analysis

Supports high throughput scenarios

VSPipeline ACID / Transcations “Traditional” row-based

MySQL Postgres Oracle MSSQL

NewSQL VoltDB (scale-out) Google Spanner/F1 MemSQL Clustrix

OLTP Key and Hiearchical Based

Wide Columnar Stores BigTable / HBase Cassandra

Hiearchical/Document MongoDB Couchbase

Key-Value Stores Redis Memcachd FoundationDB

Tuple/Triple-stores

Other

Query Optimized Amazon Redshift HP Vertica Infobright Google BigQuery Teradata Cloudera Impala Hadoop+Hive

Data Warehousing

http://www.se-radio.net/2013/12/episode-199-michael-stonebraker/

Mike Stonebraker

Illustra (c Postgres), aquired by IBM Informix (1996)

StreamBase (c Aurora), acquired by TIBCO (2013)

Vertica (c C-Store), aquired HP (2011)

VoltDB (c H-Store) 23M function in 4 rounds

Paradigm (c SciDB)

INGRES – 73 -> 90

Postgres – 84 -> 92

Mariposa – 92 -> 97

Aurora – 01 -> 08

C-Store – 05 -> 09

H-Store – 07 -> Present

SciDB – 08 -> Present

Data Warehouse Solutions

Big Data, Small Analytics => Don’t use MapReduce

http://www.slideshare.net/Hapyrus/amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive

Data Warehousing / Scientific Analysis => Columnar

You’ve got to know what regression means, what Naïve Bayes means, what k-Nearest Neighbors means. It’s all statistics.

All of that stuff turns out to be defined on arrays. It’s not defined on tables. The tools of future data scientists are going to be array-based tools. Those may live on top of relational database systems. They may live on top of an array database system, or perhaps something else. It’s completely open.

• Columns -> Faster Queries• Divide columns into chunks• Compress chunks (better

ratios than rows)• Pre-compute chunk-level

attributes (min/max etc)• Flexible storage layer

• Distributed• Encodings (Parquet,

ORC/Hive, custom)

Extract, Transform, Load (ETL)

“Dimensional Moedeling”- Fact tables & dimensional tables- Fact tables often measurements over time- Dimensional table goes into item details- Denormalized data, complexity hidden- Often many sources loaded into same warehouse

- Logs- One or more relational databases (sales, customer-facing etc)- Vender / Payment information

Example“Like table”: datetime, user_id, post_id,client_data“User table”: user_id, subscription_type, last_paid, has_android_app

Genomics (Other Life Science) DataData Warehouse Like

Gabe’s Adjusted “Moore’s Law” NGS Cost Graph

Sequencers: Versatile tools for science

Genomics is Big Data

5,000 public data repositories Broad Institute:

- Process 40K samples/year- 1000 people- 51 High Throughput Sequencers- 10+ PB of storage

1 Genome in Data- ~300GB Compressed Sequence Data- ~150MB Compressed Variant Data- Seq data went through 5-6 steps

We Want Variants

Differences between your DNA and a reference come in man sizes:- Single letter substitutions are called

Single Nucleotide Polymorphisms (SNPs)

- Small “length polymorphisms” are called Insertions/Deletions (InDels)

- Large duplications/deletiosn are called Copy Number Variations

Average European has ~3 million small variations to the reference. 100K of those in the 30K “gene coding” regions (~2% of the genome)

Next Generation Sequencing Analysis

PrimaryAnalysis

Secondary Analysis

TertiaryAnalysis

“Sense Making”

Analysis of hardware generated data, software built by vendors Use FPGA and GPUs to handle real-time optical or eletrical signals

from sequencing hardware

Filtering/clipping of “reads” and their qualities Alignment/Assembly of reads Recalibrating, de-duplication, variant calling on aligned reads

QA and filtering of variant calls Annotation (querying) variants to databases, filtering on results Merging/comparing multiple samples (multiple files) Visualization of variants in genomic context Statistics on matrixes

Applications of NGS Data in the Clinic

Carrier screening – prenatal and standard

Lifetime risk prediction

Genetic disorder diagnostics

Oncology care

PGx – dosage and care

Public Annotations – Left Joins

Exact Matching “Variants”- “Population Catalogs”

- 1000 Genomes (84M variants)- NHLBI 6,500 Exomes (2M variants)- ExAC 61,486 exomes (10M variants)

- Clinical Classifications- Precomputed predictions / scores

- dbNSFP - 89.6M predictions

Algorithmic Classifciation- How variant interacts with genes (85K tx)

Region Based- Disease regions- Gene Lists

Annotations are Hard!

HGVS is a standard that is not standard- Tries to serve different goals- Many representations of same variant- Should not be used as IDs, but not many

good alternatives

Transcripts- Transcript set choice extremely important,

hard to curate with meaningful attributes as well.

Public Data Curation- ClinVar: multi-record lines- NHLBI: MAF vs AAF, splitting “glob” fields- 1kG: No genotype counts- ExAC: Multi-allelic splitting, left-align- COSMIC: No Ref/Alt, only HGVS- dbNSFP: Abbreviations and aggregate scores

Versioning and Issues- ClinVar missing variants in VCF- dbSNP patches without version changes

Splice Mutation

asdf

N-Glycanase Deficiency

http://www.ngly1.org/ Matthew Might and Matt Wilsey. The

shifting model in clinical diagnostics: how next-generation sequencing and families are altering the way rare diseases are discovered, studied, and treated. Genetics in Medicine. March 2014.

http://www.ngly1.org/


Personalized Medicine

http://www.ngly1.org/ Matthew Might and Matt Wilsey. The

shifting model in clinical diagnostics: how next-generation sequencing and families are altering the way rare diseases are discovered, studied, and treated. Genetics in Medicine. March 2014.

Cancer is a disease of the genome “Molecular Targeted” drugs effective usually side-effect free Required genetic testing to direct cancer treatment becoming affordable



Tabular Storage Format

Postgres FDW

TSF

Use SQLite as container.

SQLite has great cache, multi-threaded and read/write properties

Specialized genomic index, also lexigraphical indexes (LevelDB to do string sorting)

GZIP / BLOSC chunk compression

Primitive, Enums and List Types

TSF in Practice - VarSeq

TSF Backed Relational Data Store

More efficient conditional queries Invisible Joins (i.e. row_id => array

offset) Size on disk "NULL [NA, Missing values] values

are part of the domain space, which avoids auxiliary bit masks at the expensive of 'loosing' a single value from the domain.”

SQL front-end allows using as back-end to existing analytic and web-stacks

CS Guest Lecture 2015 10-05 advanced databases

Data & Analytics