Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010

Big Bio and Fun with HBase

Brian O'ConnorPipeline Architect

UNC Lineberger ComprehensiveCancer Center

Overview● The era of “Big Data” in biology

● Sequencing and scientific queries

● Computational requirements and data growth

● The appeal of HBase/Hadoop

● The SeqWare project and the Query Engine (QE)

● Implementation of HBase QE

● How well it works

● The future

● Adding indexing and search engine

● Why HBase/Hadoop are important to modern biology

Biology In Transition

● Biologists are used to working on one gene their entire careers

● In the last 15 years biology has been transitioning to a high-throughput, data-driven discipline

● Now scientist study systems:

● Biology is not physics

Billions of bases of sequenceMillions of SNPsThousands of genes

The Beginnings

● The Human Genome with Sanger Sequencing● 1990-2000● $3 billion

● Gave us the blueprints for a human being, let us find most of the genes in humans, understand variation in humans

Capillary Sequencer

Output ~1000 bases

Illumina GAIIx● 2x76 run, ~8 days● >= 98.5% accuracy at 76 bases● ~300M reads / flowcell, ~20-25GB high qual 38M/lane, 2.5GB/lane, about 1.25GB aligned● Cost ~$10K, about $450/GB● ~20GB in ~8 days

SOLiD 3 Plus● 2x50 run ~14 days● >= 99.94% accuracy (colorspace corrected)● ~1G reads / run, ~60GB high quality● Conservatively maybe 40GB aligned● ~60GB in ~14 days● Cost ~$20K, $333/GB● A Human Genome in 2 Weeks!

454 GS FLX● 400bp, ~10 hours● Q20 at 400 bases● ~1M reads / run ~400MB high qual● Homopolymer issue● Cost ~$14K for 70x75, ~$90K/GB● So ~1GB in ~1day

HeliScope● ~35bp run, ~8 days● <5% raw error rate● ~700M reads/run ~25GB/run● Cost ?● So ~25GB in ~8days

The Era of “Big Data” in Biology

Types of Questions Biologists Ask● Want to ask simple questions:

● “What SNVs are in 5'UTR of phosphatases?”● “What frameshifts affect PTEN in lung cancer?”● “What genes include homozygous, non-

synonymous SNVs in all ovarian cancers?”● Biologists want to see data across samples● Database natural choice for this data, many examples

Impact of Next Gen Sequencing

● If the human genome gave us the generic blue prints, next gen sequencing lets us look at individuals blue prints

● ApplicationsSequence many people

look at mutation patterns

normal

tumor

mutations justin tumor

Sequence individuals cancerand find distinct druggable targets

Have disease

Does not have disease

subtract

The Big Problem

● Biologist think about reagents and tumor not hard drives and CPUs

● The crazy growth exceeds the growth of all IT components

● Dramatically different models for scalability are required

● We're soon reaching a point where a genome will cost less to sequence than it does to look at the data!

Increase in Sequencer Output

08/10/06 02/26/07 09/14/07 04/01/08 10/18/08 05/06/09 11/22/09 06/10/10 12/27/10

10000000

100000000

1000000000

10000000000

100000000000

Illumina Sequencer Ouput

Sequence File Sizes Per Lane

Date

File

Siz

e (

Byt

es)

Log scale

Suggests Sequencer OutputIncreases by 10x Every 2 Years!

Far outpacing hard drive,CPU, and bandwidth growth

Moore's Law:CPU Power DoublesEvery 2 Years

Kryder's Law:Storage QuadruplesEvery 2 Years

http://genome.wellcome.ac.uk/doc_WTX059576.html

Lowering Costs = Bigger Projects

● Falling costs increase scope of projects● The Cancer Genome Atlas (TCGA)

● 20 tumor types, 200 samples each in 4 years● Around 4 Petabytes of data

● Once costs < $1000 many predict the technology will move into clinical applications● 1.5 million new cancer patients in 2010● 1,500 Petabytes per year!?● 4 PB/day vs 0.2 PB/day Facebook

http://www.datacenterknowledge.com/archives/2009/10/13/facebook-now-has-30000-servers/http://seer.cancer.gov/statfacts/html/all.html

So We Need to Rethink Scaling UpParticularly for Databases

● The old ways of growing can't keep up... can't just “buy a bigger box”

● We need to embrace what the Peta-scale community is currently doing

● The Googles, Facebooks, Twitters, etc of the world are solving their scalability problems, biology needs to learn from their solutions

● Clusters can be expanded, distributed storage used, least scalable portion is the database which biologist need to ask questions

Technologies That Can Help

● Many open source tools designed for Petascale● Map/Reduce for data processing● Hadoop for distributed file systems (HDFS) and

robust process control● Pig/HIVE/etc processing unstructured/semi-

structured data● HBase/Cassandra/etc for databasing

● Could go unstructured but biological data is most useful when aggregated and a database is extremely good for this

HBase to the Rescue● Billions of rows x millions of columns!● Focus on random access (vs. HDFS)● Table is column oriented, sparse matrix● Versioning (timestamps) built in● Flexible storage of different data types● Splits DB across many nodes transparently● Locality of data, I can run map/reduce jobs that

process the table rows present on a given node● 22M variants processed <1 minute on 5 node cluster

Magically Scalable Databases

● Talking about distributed databases, forces a huge shift in what you can and can't do

● “NoSQL is an umbrella term for a loosely defined class of non-relational data stores that break with a long history of relational databases and ACID guarantees. Data stores that fall under this term may not require fixed table schemas, and usually avoid join operations. The term was first popularized in early 2009.”

http://en.wikipedia.org/wiki/Nosql

What You Give Up

● SQL queries● Well defined schema, normalized data structure● Relationships manged by DB● Flexible and easy indexing of table columns● Existing tools that query a SQL database must

be re-written● Certain ACID aspects● Software maturity, most distributed NOSQL

projects are very new

What You Gain

● Scalability is the clear win, you can have many processes on many nodes hit the collection of database servers

● Ability to look at very large datasets and do complex computations across a cluster

● More flexibility in representing information now and in the future

● HBase includes data timestamps/versions● Integration with Hadoop

HBase Architecture

http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

HBase TablesConceptual View

Physical Storage View

HBase APIs

● Basic API● Connect to an HBase server like it's a single server● Lets you iterate over the contents of the DB, with

flexible filtering of the results● Can pull back/write key/values easily

● Map/Reduce source/sink● Can use HBase tables easily from Map/Reduce● May be easier/faster just to Map/Reduce than filter

with the API● I want to use this more in the future

HBase Installation

● Install Hadoop first, version 0.20.x● I installed HBase version 0.20.1● Documented: http://tinyurl.com/23ftbrk● Currently have a 7 node Hadoop/HBase cluster

running at UNC with about 40TB of storage, 56 CPUs, 168GB RAM

● RPMs from Cloudera (0.89.20100924)http://www.cloudera.com

SeqWare LIMS

Central DBcoordinatesall analysisand resultsmetadata

Genome Browser

Central portal for users that links to the tools to upload samples, trigger analysis, and view results

Data import tool facilitatessequence delivery to thestorage via network

Import Daemons

thrift

network

Java Perl Python

SeqWare API*

Developer API, savvy userscan control all of SeqWare's' toolsprogrammatically

Genome browser and query engine frontend

SeqWare Pipeline SGECluster

SGECluster

SGECluster

Controls analysis on the cluster, models analysis workflows for RNA-Seq and other NGS experimental designs

S

S

S

S

Storage

Nodes

HadoopCluster

HadoopCluster

SeqWare Query EngineHBase

HBase

High-performance, distributed database and web query engine, powers both browser and interactive queries

Big Data Small Data

The SeqWare Project

* future project

SeqWareMetaDB

Fully open sourcehttp://seqware.sf.net

HadoopCluster

HadoopCluster

SeqWare Query Engine

HBase

HBase

High-performance, distributed database and web query engine, powers both browser and interactive queries

The SeqWare Query Engine Project

HBaseHadoopCluster

Web API Service

Interactive Web Forms

Genome Browser

Genome browser and query engine frontend

● SeqWare Query Engine is our HBase database for next gen sequence data

Webservice Interfaces

track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009" chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])

SeqWare Query Engine includes a RESTful web service that returns XML describing variant DBs, a web form for querying, and returns BED/WIG data available via a well-defined URL

http://server.ucla.edu/seqware/queryengine/realtime/variants/mismatches/1?format=bed&filter.tag=PTEN

orRESTful XML Client API HTML Forms

SeqWare QE Architecture Concepts

● Types: variants, translocations, coverage, consequence, and generic “features”

● Each can be “tagged” with arbitrary key-value pairs, allows for the encoding of a surprising amount of annotation!

● Fundamental query is a list of objects filtered by these “tags”

Indel

translocation

* gene_fusion

* nonsynonmous* is_dbSNP | rs10292192

SNV

* nonsynonmous* is_dbSNP | rs10292192

chr1

chr12

chr12chr3

Requirements for QueryEngine Backend

The backend must:– Represent many types of data– Support a rich level of annotation – Support very large variant databases

(~3 billion rows x thousands of columns)– Be distributed across a cluster– Support processing, annotating, querying &

comparing samples (variants, coverage, annotations)

– Support a crazy growth of data

HBase Query Engine Backend

● Stores variants, translocations, coverage info, coding consequence reports, annotations as “tags”, and generic “features”

● Common interface, uses HBase as backend● Create a single database for all genomes● Database is auto-sharded across cluster● Can use Map/Reduce and other projects to do

sophisticated queries● Performance seems very good!

HBase SeqWare Query Engine

RegionServer

RegionServer

HBase on HDFSVariant & CoverageDatabase System

Analysis Node

Web Nodes

Analysis Node

Querying &LoadingNodes processqueries via API

RESTfulWeb Service

Backend

MetaDB

BED/WIGFiles

XMLMetadata

HMaster

ETL Map Job

ETL Map Job

ETLReduce Job

ETLjobs extract,transform, &/orload in parallel

Webservice

MapReduce HBase API

client

Underlying HBase Tables

chr15:00000123454

key

byte[]

variant:genome4

byte[]

variant:genome7

byte[]

coverage:genome7hg18Table

Variant object byte array

Database on filesystem (HDFS)

family label

chr15:00000123454

key

t1

timestamp

genome7

column:variant

byte[]

is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123

key

byte[]

rowId:Genome1102NTagIndexTable

queries look up by tag thenfilter the variant results

Backend Performance Comparison

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

Pileup Load Time 1102N

HBase vs. Berkeley DB

load time bdbload time hbase

time (s)

vari

an

ts

Backend Performance Comparison

0 1000 2000 3000 4000 5000 6000 7000

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

BED Export Time 1102NHBase API vs. M/R vs. BerkeleyDB

dump time bdbdump time hbasedump time m/r

time

vari

an

ts

Querying with Map/Reduce

● Finding variants with tag or comparing at same genomic position very efficient

● Problem of overlapping features not starting at the same genomic location

Chr1Variants

Chr2Variants

Find all variants seen in ovarian cancer genomes tagged as “frameshift”

Iterate over everyvariant, bin if ovarianand tagged frameshift

For each item in bin,print out the variant

information

Map Reduce

Map Reduce

Chr3Variants

Map Reduce

Node 1

Node 2

Node 3

Problematic Querying Map/Reduce

● Hard to do overlaps queries which are really important for biological DBs

● Here if I did a Map/Reduce query only SNV1 and SNV2 would “overlap”

chr1:1200-1500

chr1:1400-1700

Indel 1

Indel 2

SNV 1

chr1:000001200 Indel 1 byte[]

chr1:000001400 Indel 2 byte[]

chr1:000001450 SNV 1 byte[]

SNV 2

In the DB

SNV 2 byte[]

In the genome

SeqWare Query Engine Status

● Open source, you can try it now!● Both BerkeleyDB & HBase backends● Multiple genomes stored in the same table, very

Map/Reduce compatible for SNVs● Basic secondary indexing for “tags”● API used for queries via Webservice● Prototype Map/Reduce examples including

“somatic” mutation detection in paired normal/cancer samples

SeqWare Query Engine Future● Dynamically building R-tree indexes or

Nested Containment Lists with Map/Reduce“Experiences on Processing Spatial Data with MapReduce” by Cary et al.

● Looking at using Katta or Solr for indexing free text data such as gene descriptions, OMIM entries, etc.

● Queries across samples with simple logic● More testing, pushing our 7 node cluster,

finding the max number of genomes this cluster can support

Final Thoughts

● Era of Big Data for Biology is here!● CPU bound problems no doubt but as short

reads become long reads and prices per GBase drop the problems seem to shift to handling/mining data

● Tools designed for Peta-scale datasets are key

Yahoo's HadoopCluster

For More Information

● http://seqware.sf.net● http://hadoop.apache.org● http://hbase.apache.org● Brian O'Connor <[email protected]>● Article (12/21/2010):

SeqWare Query Engine: storing and searching sequence data in the cloudBrian D O’Connor , Barry Merriman and Stanley F NelsonBMC Bioinformatics 2010, 11(Suppl 12)

● We have job openings!!

Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010

Technology