Challenges and Opportuni2es of Big Data Genomics
Yasin Memari Wellcome Trust Sanger Ins2tute
January 2014
Outline • Big data genomics: hype or reality? • Limita2ons of big data analysis • Hardware and soKware solu2ons • Bioinforma2cs using MapReduce • Hadoop Distributed File System • Cloud compu2ng for Genomics • Configuring VRPipe in the cloud • Lessons from cloud compu2ng • A unified bioinforma2cs plaRorm
Big Data Genomics: Hype or reality?
• BoSleneck in sequencing has moved from data genera2on to data handling.
• World’s sequencing capacity at 15PB in 2013, and expected to double every year.
• 10 petabytes of storage required for 100,000 human genomes (50X ~ 100GB each).
• $100 per year cost of storing each genome. • Data deluge is inevitable in the interim as sequencing
becomes cheaper. • In the long term, DNA itself is a beSer storage medium! • Throughput from metagenomic and single-‐cell sequencing
will rapidly outpace hard gains in compression.
Use case scenario: Run lobSTR on the above datasets to understand varia2on at Short Tandem Repeats on a genome-‐wide and popula2on-‐wide scale and how they contribute to phenotypic varia2on.
“D and A” Model
Sanger’s farm data flow
Download/transfer and Analyze: • I/O intensive jobs can overload NAS fileservers.
• High performance file systems provide fast access to data to mul2ple clients.
• Network performance is the limi2ng factor for big data.
Filesystem load
Compress the data! High coverage equals high redundancy! Images/TIFF files no longer in use. No intermediate fastq files. Bcl and locs/clocs -‐> bam (directly) BAM is being replaced by CRAM (30% reduc2on in size) Discard the read data every 5-‐10 years! More compression? Smooth out sequencing errors, normalize the coverage, down-‐sample, etc.
How can we improve storage performance? • Scale-‐out architectures are s2ll costly and imprac2cal e.g. scale-‐
out NAS ($1000/TB) or SAN over Fibre Channel. • Solid-‐state drives (SSDs) are being used to enhance cache
memory and IOPS performance. • Hybrid storage systems integrate SSDs into tradi2onal HDD-‐
based storage arrays as a 1st 2er of storage. • Avere FXT and Nexsan NST store warm data in SSDs for storage
accelera2on. Migrate cold data to powered-‐down drives. • Fast random access can be achieved when storing metadata in
Flash SSDs. Limited gain for sequen2al access! • Alterna2vely, archive the data in cheap object stores in the
cloud, but invest in bandwidth!
What can be done about network latency? • Use high performance network protocols (e.g. UDP-‐based UDT)
to achieve higher speeds that can be achieved with TCP. • Aspera’s fasp accelerates the transfer in high-‐latency high-‐loss
networks where the transport protocol is a boSleneck. • Transmission rates can be enhanced using mul2ple concurrent
transfers (mul2-‐part downloads): • GeneTorrent is a file transfer client applica2on based on
BitTorrent technology (up to 200MB/s over the internet). • GridFTP (implemented in Globus toolkit) enables reliable and
high-‐speed transmission of very large files (up to ~800MB/s when scp is 17MB/s).
• High speed internet connec2on StarLight/internet2? Firewall and network security problems.
Alterna2ve Models What types of analyses do we run in genomics? • Embarrassingly parallel algorithms:
Most sequence analysis soKware have distributed solu2ons. E.g. Alignment, imputa2on, etc. Use genome chunking and run in batches! • Tightly-‐coupled algorithms:
Some require message passing or shared memory, e.g. genome assembly, pathway analysis. Forms of parallelism: • Task parallelism: Distribute the execu2on threads across different
nodes. • Data parallelism: Distribute the data across different execu2on nodes.
Healthcare data need to be stored and analyzed centrally!
Map-‐Reduce Framework A distributed soluGon to a data-‐centric problem: • Map: Divide up the problem into smaller chunks and send each
compute task to where the data resides. • Reduce: Collect the answers to each sub-‐problem and combine the
results.
Example: K-‐mer Coun2ng
(ATG:1)!
(TGA:1)!
(GAA:1)!
(AAC:1)!
(ACC:1)!
(CCT:1)!
(CTT:1)!
(TTA:1)!
(GAA:1)!
(AAC:1)!
(ACA:1)!
(CAA:1)!
(AAC:1)!
(ACT:1)!
(CTT:1)!
(TTA:1)!
(TTT:1)!
(TTA:1)!
(TAG:1)!
(AGG:1)!
(GGC:1)!
(GCA:1)!
(CAA:1)!
(AAC:1)!
map reduce
K-mer Counting •! Application developers focus on 2 (+1 internal) functions
–! Map: input ! key:value pairs
–! Shuffle: Group together pairs with same key
–! Reduce: key, value-lists ! output
ATGAACCTTA!
GAACAACTTA!
TTTAGGCAAC!
ACA -> 1!
ATG -> 1!
CAA -> 1,1!
GCA -> 1!
TGA -> 1!
TTA -> 1,1,1!
ACT -> 1!
AGG -> 1!
CCT -> 1!
GGC -> 1!
TTT -> 1!
AAC -> 1,1,1,1!
ACC -> 1!
CTT -> 1,1!
GAA -> 1,1!
TAG -> 1!
ACA:1!
ATG:1!
CAA:2!
GCA:1!
TGA:1!
TTA:3!
ACT:1!
AGG:1!
CCT:1!
GGC:1!
TTT:1!
AAC:4!
ACC:1!
CTT:2!
GAA:2!
TAG:1!
Map, Shuffle & Reduce
All Run in Parallel
shuffle
Michael Schatz
Hadoop Distributed File System (HDFS) • Apache Hadoop, an open-‐source implementa2on of Google’s MapReduce and
Google File System (GFS). • A highly reliable and scalable solu2on to storage and processing of massive data
using cheap commodity hardware. • Op2mised for high throughput access to data. Data is replicated for fault tolerance.
HDFS vs Lustre
Hadoop • Data is local: data nodes act as
compute nodes. • I/O is not very relevant here,
although it can be improved by concurrency.
• Op2mised for batch processing. • Single-‐node boSlenecks or name
node failures.
Lustre • Data is shared: compute clients
talk to object store servers. • High aggregate I/O can be
achieved with striping. • Op2mised for HPC. Used in
Top500! • BoSleneck is gerng the data on
lustre!
DAS
CPU
Network
DAS
CPU
DAS
CPU
OST
Client
Network
OST
Client Client
OSS
OST OST
OSS
Bioinforma2c Tools for Hadoop Suites of tools ac2vely under development: • SeqPig: A library which u2lizes Apache Pig to translate sequence
data analysis into a sequence of MapReduce jobs. • Seal: A collec2on of distributed applica2ons for alignment and
manipula2on of short read sequence data. • SeqWare: A toolkit for building high-‐throughput sequencing data
analysis workflows in cloud-‐based environments. • And many algorithms for sequence mapping (CloudAligner) and
SNP calling (Crossbow), de novo assembly (Contrail), peak calling (PeakRanger) and RNA-‐Seq data analysis (Eoulsan, FX and Myrna).
Virtualiza2on increases u2liza2on of costly hardware: • En2re workflows as Virtual Machines residing in SAN. VMs are
sent to hypervisors for execu2on.
Storage-‐Area Network (SAN)
VM VM
VM VM
Hardware (CPU, Memory, etc)
Hypervisor (Xen, Hyper-‐V, etc)
Hardware Virtualiza2on
App
OS
App
OS
App
OS
Fiber Channel
Management Console
Cloud Compu2ng What does AWS cloud have to offer? • Networking: Direct connect, Virtual Private Cloud (VPC),
Route 53 • Compute: Elas2c Compute Cloud (EC2), Elas2c MapReduce • Storage: Simple Storage Service (S3), Glacier, Storage
Gateway, CloudFront • Database: Rela2onal Database Service (RDS), DynamoDB,
Elas2Cache, RedshiK • Management: Iden2ty and Access Management (IAM),
CloudWatch, CloudForma2on, Elas2c BeanStalk
Network Performance within Amazon
• Bandwidth speeds within AWS are way too low for moving big genome data.
• Experiments achieve maximum 70-‐80MB/s speeds between two EC2 instances and 10-‐20MB/s between EC2 and S3.
• Download from S3 to EC2 is unreliable and constrained given data inges2on over HTTP.
• Gigabit Ethernet in EC2 is only available with cluster instances • Enhanced networking using network virtualiza2on may
provide higher I/O performance. • CloudFront, Amazon’s content delivery service provides
streaming at HD rates only. • AWS Data Pipeline is not up to the task of big data workflows.
VRPipe in the Cloud To deploy the VRPipe in the cloud one needs to sa2sfy the following requirements (Sendu Bala): • Set up a DataBase Management System (DBMS) for VRPipe
in the AWS RDS (or use a SQLite or locally installed MySQL database)
• Create a distributed file system to provide shared access to soKware and data (adjust for speed or redundancy)
• Configure the VRPipe and provide the required permissions and security creden2als
• Install and configure a job scheduling system supported by VRPipe, e.g. SGE or LSF
hSps://github.com/VertebrateResequencing/vr-‐pipe/wiki
Tes2ng VRPipe in AWS Cloud Alignment and calling of 110 Phase 3 YRI exomes (~1.1TB): (sequence.index) -‐> 2. 1000genomes_illumina_mapping_with_improvement 27. bam_merge_lanes_and_fix_rgs 61. snp_calling_mpileup 59. snp_calling_gatk_unified_genotyper_and_annotate 89. vcf_gatk_filter 90. vcf_merge 93. vcf_vep_annotate
• Set up GlusterFS volume using EBS blocks aSached to EC2 instances. • Enable Elas2c Load Balancing within VPC and grant r/w privileges to DBMS. • Op2onally use SGE job scheduling in conjunc2on with EC2 load balancing.
Lessons from AWS Cloud • The bulk of the cloud is made of general purpose hardware suitable
for enterprise compu2ng. • Scien2fic applica2ons require compute-‐op2mised HPC plaRorms
and high-‐speed I/O and storage. • On-‐demand services are expensive, but large organiza2ons may
benefit from the economy of scale!? • As a self-‐service environment, the user should handle sysadmin
tasks including provisioning and configura2on. • EC2 not being able to compute against S3 (high-‐IO tasks) recalls the
same “D and A” problem! • ElasFc MapReduce (EMR) runs on EC2 instances, with ephemeral
disks used to build HDFS, so data need to be streamed in/out of S3. • Virtualiza2on imposes performance penal2es as the available
physical resources are shared among VMs.
Bio-‐cloud Prototypes • The EBI has developed an in-‐house cloud for public sequence repositories such as
the European Genome-‐Phenome Archive (EGA). • The NaGonal Center for Biotechnology InformaGon is working on cloud
implementa2ons for storing genomic data such as dbGaP. • The Beijing Genomics InsGtute has developed five bio-‐cloud compu2ng centers in
different loca2ons that store and process genomes. • The US NaGonal Cancer InsGtute maintains the Cancer Genome Hub (CGHub)
which is a system for storing large genome data. • The Broad InsGtute has instan2ated its analysis pipeline for germline and cancer
soma2c data on commercial cloud environments. • The AMP Lab at UC Berkley has develop and is deploying its genome analysis
pipeline on commercial cloud environments. • Illumina uploads data directly to the cloud where they have created a plaRorm for
sequence analysis called Basespace. Source: Global Alliance White Paper, 3 June 2013
Data/Pipeline Sharing • Grid compu2ng in the cloud
enables sharing data and resources across virtualized servers.
• Cloud APIs enable applica2on inter operability and cross-‐plaRorm compa2bility.
• Applica2ons are able to launch and access distributed data irrespec2ve of underlying IT infrastructures.
BGI Private Cloud
Sanger Private Cloud
Broad Private Cloud
NCBI Private Cloud
EBI Private Cloud
Public cloud
A Unified PlaRorm? An open source plaRorm for storing, organizing, processing, and sharing very large genomic and biomedical data on premise or in the cloud: • Data Management: Files and metadata storage, structured/
unstructured data, provenance tracking, security and access control. • Content addressable Distributed File System: Scalability and fault-‐
tolerance, block storage of data, High performance over low latency. • Computa2on and Pipeline Processing: Pipeline crea2on tools, revision
control system, MapReduce engine, etc. • APIs and SDKs: REST and na2ve APIs, w-‐based user interface,
command Line Interface, programming languages and tools, etc. • Cloud OS and Virtualiza2on: Networking, self-‐service provisioning,
administra2on, block storage, user management, etc.
Discussion • Compute is much cheaper. Algorithms run faster and more
efficiently. • Transmission of big data will be a boSleneck. Network latency and
storage I/O are the limi2ng factors. • Minimize the data flow! • Distributed file systems have reduced the costs; rou2ne analy2cs of
big data has been made possible using cheap commodity hardware. • We should feel lucky that sequence analysis is mainly
embarrassingly parallel! • MapReduce engines may be deployed in genome data centres? • Cloud compu2ng enables data and applica2on sharing across
consolidated IT infrastructures.