Big Bio and Fun with HBase Brian O'Connor Pipeline Architect UNC Lineberger Comprehensive Cancer Center
Big Bio and Fun with HBase
Brian O'ConnorPipeline Architect
UNC Lineberger ComprehensiveCancer Center
Overview● The era of “Big Data” in biology
● Sequencing and scientific queries
● Computational requirements and data growth
● The appeal of HBase/Hadoop
● The SeqWare project and the Query Engine (QE)
● Implementation of HBase QE
● How well it works
● The future
● Adding indexing and search engine
● Why HBase/Hadoop are important to modern biology
Biology In Transition
● Biologists are used to working on one gene their entire careers
● In the last 15 years biology has been transitioning to a high-throughput, data-driven discipline
● Now scientist study systems:
● Biology is not physics
Billions of bases of sequenceMillions of SNPsThousands of genes
The Beginnings
● The Human Genome with Sanger Sequencing● 1990-2000● $3 billion
● Gave us the blueprints for a human being, let us find most of the genes in humans, understand variation in humans
Capillary Sequencer
Output ~1000 bases
Illumina GAIIx● 2x76 run, ~8 days● >= 98.5% accuracy at 76 bases● ~300M reads / flowcell, ~20-25GB high qual 38M/lane, 2.5GB/lane, about 1.25GB aligned● Cost ~$10K, about $450/GB● ~20GB in ~8 days
SOLiD 3 Plus● 2x50 run ~14 days● >= 99.94% accuracy (colorspace corrected)● ~1G reads / run, ~60GB high quality● Conservatively maybe 40GB aligned● ~60GB in ~14 days● Cost ~$20K, $333/GB● A Human Genome in 2 Weeks!
454 GS FLX● 400bp, ~10 hours● Q20 at 400 bases● ~1M reads / run ~400MB high qual● Homopolymer issue● Cost ~$14K for 70x75, ~$90K/GB● So ~1GB in ~1day
HeliScope● ~35bp run, ~8 days● <5% raw error rate● ~700M reads/run ~25GB/run● Cost ?● So ~25GB in ~8days
The Era of “Big Data” in Biology
Types of Questions Biologists Ask● Want to ask simple questions:
● “What SNVs are in 5'UTR of phosphatases?”● “What frameshifts affect PTEN in lung cancer?”● “What genes include homozygous, non-
synonymous SNVs in all ovarian cancers?”● Biologists want to see data across samples● Database natural choice for this data, many examples
Impact of Next Gen Sequencing
● If the human genome gave us the generic blue prints, next gen sequencing lets us look at individuals blue prints
● ApplicationsSequence many people
look at mutation patterns
normal
tumor
mutations justin tumor
Sequence individuals cancerand find distinct druggable targets
Have disease
Does not have disease
subtract
The Big Problem
● Biologist think about reagents and tumor not hard drives and CPUs
● The crazy growth exceeds the growth of all IT components
● Dramatically different models for scalability are required
● We're soon reaching a point where a genome will cost less to sequence than it does to look at the data!
Increase in Sequencer Output
08/10/06 02/26/07 09/14/07 04/01/08 10/18/08 05/06/09 11/22/09 06/10/10 12/27/10
10000000
100000000
1000000000
10000000000
100000000000
Illumina Sequencer Ouput
Sequence File Sizes Per Lane
Date
File
Siz
e (
Byt
es)
Log scale
Suggests Sequencer OutputIncreases by 10x Every 2 Years!
Far outpacing hard drive,CPU, and bandwidth growth
Moore's Law:CPU Power DoublesEvery 2 Years
Kryder's Law:Storage QuadruplesEvery 2 Years
http://genome.wellcome.ac.uk/doc_WTX059576.html
Lowering Costs = Bigger Projects
● Falling costs increase scope of projects● The Cancer Genome Atlas (TCGA)
● 20 tumor types, 200 samples each in 4 years● Around 4 Petabytes of data
● Once costs < $1000 many predict the technology will move into clinical applications● 1.5 million new cancer patients in 2010● 1,500 Petabytes per year!?● 4 PB/day vs 0.2 PB/day Facebook
http://www.datacenterknowledge.com/archives/2009/10/13/facebook-now-has-30000-servers/http://seer.cancer.gov/statfacts/html/all.html
So We Need to Rethink Scaling UpParticularly for Databases
● The old ways of growing can't keep up... can't just “buy a bigger box”
● We need to embrace what the Peta-scale community is currently doing
● The Googles, Facebooks, Twitters, etc of the world are solving their scalability problems, biology needs to learn from their solutions
● Clusters can be expanded, distributed storage used, least scalable portion is the database which biologist need to ask questions
Technologies That Can Help
● Many open source tools designed for Petascale● Map/Reduce for data processing● Hadoop for distributed file systems (HDFS) and
robust process control● Pig/HIVE/etc processing unstructured/semi-
structured data● HBase/Cassandra/etc for databasing
● Could go unstructured but biological data is most useful when aggregated and a database is extremely good for this
HBase to the Rescue● Billions of rows x millions of columns!● Focus on random access (vs. HDFS)● Table is column oriented, sparse matrix● Versioning (timestamps) built in● Flexible storage of different data types● Splits DB across many nodes transparently● Locality of data, I can run map/reduce jobs that
process the table rows present on a given node● 22M variants processed <1 minute on 5 node cluster
Magically Scalable Databases
● Talking about distributed databases, forces a huge shift in what you can and can't do
● “NoSQL is an umbrella term for a loosely defined class of non-relational data stores that break with a long history of relational databases and ACID guarantees. Data stores that fall under this term may not require fixed table schemas, and usually avoid join operations. The term was first popularized in early 2009.”
http://en.wikipedia.org/wiki/Nosql
What You Give Up
● SQL queries● Well defined schema, normalized data structure● Relationships manged by DB● Flexible and easy indexing of table columns● Existing tools that query a SQL database must
be re-written● Certain ACID aspects● Software maturity, most distributed NOSQL
projects are very new
What You Gain
● Scalability is the clear win, you can have many processes on many nodes hit the collection of database servers
● Ability to look at very large datasets and do complex computations across a cluster
● More flexibility in representing information now and in the future
● HBase includes data timestamps/versions● Integration with Hadoop
HBase Architecture
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
HBase TablesConceptual View
Physical Storage View
HBase APIs
● Basic API● Connect to an HBase server like it's a single server● Lets you iterate over the contents of the DB, with
flexible filtering of the results● Can pull back/write key/values easily
● Map/Reduce source/sink● Can use HBase tables easily from Map/Reduce● May be easier/faster just to Map/Reduce than filter
with the API● I want to use this more in the future
HBase Installation
● Install Hadoop first, version 0.20.x● I installed HBase version 0.20.1● Documented: http://tinyurl.com/23ftbrk● Currently have a 7 node Hadoop/HBase cluster
running at UNC with about 40TB of storage, 56 CPUs, 168GB RAM
● RPMs from Cloudera (0.89.20100924)http://www.cloudera.com
SeqWare LIMS
Central DBcoordinatesall analysisand resultsmetadata
Genome Browser
Central portal for users that links to the tools to upload samples, trigger analysis, and view results
Data import tool facilitatessequence delivery to thestorage via network
Import Daemons
thrift
network
Java Perl Python
SeqWare API*
Developer API, savvy userscan control all of SeqWare's' toolsprogrammatically
Genome browser and query engine frontend
SeqWare Pipeline SGECluster
SGECluster
SGECluster
Controls analysis on the cluster, models analysis workflows for RNA-Seq and other NGS experimental designs
S
S
S
S
Storage
Nodes
HadoopCluster
HadoopCluster
SeqWare Query EngineHBase
HBase
High-performance, distributed database and web query engine, powers both browser and interactive queries
Big Data Small Data
The SeqWare Project
* future project
SeqWareMetaDB
Fully open sourcehttp://seqware.sf.net
HadoopCluster
HadoopCluster
SeqWare Query Engine
HBase
HBase
High-performance, distributed database and web query engine, powers both browser and interactive queries
The SeqWare Query Engine Project
HBaseHadoopCluster
Web API Service
Interactive Web Forms
Genome Browser
Genome browser and query engine frontend
● SeqWare Query Engine is our HBase database for next gen sequence data
Webservice Interfaces
track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009" chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])
SeqWare Query Engine includes a RESTful web service that returns XML describing variant DBs, a web form for querying, and returns BED/WIG data available via a well-defined URL
http://server.ucla.edu/seqware/queryengine/realtime/variants/mismatches/1?format=bed&filter.tag=PTEN
orRESTful XML Client API HTML Forms
SeqWare QE Architecture Concepts
● Types: variants, translocations, coverage, consequence, and generic “features”
● Each can be “tagged” with arbitrary key-value pairs, allows for the encoding of a surprising amount of annotation!
● Fundamental query is a list of objects filtered by these “tags”
Indel
translocation
* gene_fusion
* nonsynonmous* is_dbSNP | rs10292192
SNV
* nonsynonmous* is_dbSNP | rs10292192
chr1
chr12
chr12chr3
Requirements for QueryEngine Backend
The backend must:– Represent many types of data– Support a rich level of annotation – Support very large variant databases
(~3 billion rows x thousands of columns)– Be distributed across a cluster– Support processing, annotating, querying &
comparing samples (variants, coverage, annotations)
– Support a crazy growth of data
HBase Query Engine Backend
● Stores variants, translocations, coverage info, coding consequence reports, annotations as “tags”, and generic “features”
● Common interface, uses HBase as backend● Create a single database for all genomes● Database is auto-sharded across cluster● Can use Map/Reduce and other projects to do
sophisticated queries● Performance seems very good!
HBase SeqWare Query Engine
RegionServer
RegionServer
HBase on HDFSVariant & CoverageDatabase System
Analysis Node
Web Nodes
Analysis Node
Querying &LoadingNodes processqueries via API
RESTfulWeb Service
Backend
MetaDB
BED/WIGFiles
XMLMetadata
HMaster
ETL Map Job
ETL Map Job
ETLReduce Job
ETLjobs extract,transform, &/orload in parallel
Webservice
MapReduce HBase API
client
Underlying HBase Tables
chr15:00000123454
key
byte[]
variant:genome4
byte[]
variant:genome7
byte[]
coverage:genome7hg18Table
Variant object byte array
Database on filesystem (HDFS)
family label
chr15:00000123454
key
t1
timestamp
genome7
column:variant
byte[]
is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123
key
byte[]
rowId:Genome1102NTagIndexTable
queries look up by tag thenfilter the variant results
Backend Performance Comparison
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
Pileup Load Time 1102N
HBase vs. Berkeley DB
load time bdbload time hbase
time (s)
vari
an
ts
Backend Performance Comparison
0 1000 2000 3000 4000 5000 6000 7000
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
BED Export Time 1102NHBase API vs. M/R vs. BerkeleyDB
dump time bdbdump time hbasedump time m/r
time
vari
an
ts
Querying with Map/Reduce
● Finding variants with tag or comparing at same genomic position very efficient
● Problem of overlapping features not starting at the same genomic location
Chr1Variants
Chr2Variants
Find all variants seen in ovarian cancer genomes tagged as “frameshift”
Iterate over everyvariant, bin if ovarianand tagged frameshift
For each item in bin,print out the variant
information
Map Reduce
Map Reduce
Chr3Variants
Map Reduce
Node 1
Node 2
Node 3
Problematic Querying Map/Reduce
● Hard to do overlaps queries which are really important for biological DBs
● Here if I did a Map/Reduce query only SNV1 and SNV2 would “overlap”
chr1:1200-1500
chr1:1400-1700
Indel 1
Indel 2
SNV 1
chr1:000001200 Indel 1 byte[]
chr1:000001400 Indel 2 byte[]
chr1:000001450 SNV 1 byte[]
SNV 2
In the DB
SNV 2 byte[]
In the genome
SeqWare Query Engine Status
● Open source, you can try it now!● Both BerkeleyDB & HBase backends● Multiple genomes stored in the same table, very
Map/Reduce compatible for SNVs● Basic secondary indexing for “tags”● API used for queries via Webservice● Prototype Map/Reduce examples including
“somatic” mutation detection in paired normal/cancer samples
SeqWare Query Engine Future● Dynamically building R-tree indexes or
Nested Containment Lists with Map/Reduce“Experiences on Processing Spatial Data with MapReduce” by Cary et al.
● Looking at using Katta or Solr for indexing free text data such as gene descriptions, OMIM entries, etc.
● Queries across samples with simple logic● More testing, pushing our 7 node cluster,
finding the max number of genomes this cluster can support
Final Thoughts
● Era of Big Data for Biology is here!● CPU bound problems no doubt but as short
reads become long reads and prices per GBase drop the problems seem to shift to handling/mining data
● Tools designed for Peta-scale datasets are key
Yahoo's HadoopCluster
For More Information
● http://seqware.sf.net● http://hadoop.apache.org● http://hbase.apache.org● Brian O'Connor <[email protected]>● Article (12/21/2010):
SeqWare Query Engine: storing and searching sequence data in the cloudBrian D O’Connor , Barry Merriman and Stanley F NelsonBMC Bioinformatics 2010, 11(Suppl 12)
● We have job openings!!