Challenges and Opportunities of Big Data Genomics

Challenges and Opportuni2es of Big Data Genomics

Yasin Memari Wellcome Trust Sanger Ins2tute

January 2014

Outline •  Big data genomics: hype or reality? •  Limita2ons of big data analysis •  Hardware and soKware solu2ons •  Bioinforma2cs using MapReduce •  Hadoop Distributed File System •  Cloud compu2ng for Genomics •  Configuring VRPipe in the cloud •  Lessons from cloud compu2ng •  A unified bioinforma2cs plaRorm

Big Data Genomics: Hype or reality?

•  BoSleneck in sequencing has moved from data genera2on to data handling.

•  World’s sequencing capacity at 15PB in 2013, and expected to double every year.

•  10 petabytes of storage required for 100,000 human genomes (50X ~ 100GB each).

•  $100 per year cost of storing each genome. •  Data deluge is inevitable in the interim as sequencing

becomes cheaper. •  In the long term, DNA itself is a beSer storage medium! •  Throughput from metagenomic and single-‐cell sequencing

will rapidly outpace hard gains in compression.

Use case scenario: Run lobSTR on the above datasets to understand varia2on at Short Tandem Repeats on a genome-‐wide and popula2on-‐wide scale and how they contribute to phenotypic varia2on.

“D and A” Model

Sanger’s farm data flow

Download/transfer and Analyze: •  I/O intensive jobs can overload NAS fileservers.

•  High performance file systems provide fast access to data to mul2ple clients.

•  Network performance is the limi2ng factor for big data.

Filesystem load

Compress the data! High coverage equals high redundancy! Images/TIFF files no longer in use. No intermediate fastq files. Bcl and locs/clocs -‐> bam (directly) BAM is being replaced by CRAM (30% reduc2on in size) Discard the read data every 5-‐10 years! More compression? Smooth out sequencing errors, normalize the coverage, down-‐sample, etc.

How can we improve storage performance? •  Scale-‐out architectures are s2ll costly and imprac2cal e.g. scale-‐

out NAS ($1000/TB) or SAN over Fibre Channel. •  Solid-‐state drives (SSDs) are being used to enhance cache

memory and IOPS performance. •  Hybrid storage systems integrate SSDs into tradi2onal HDD-‐

based storage arrays as a 1st 2er of storage. •  Avere FXT and Nexsan NST store warm data in SSDs for storage

accelera2on. Migrate cold data to powered-‐down drives. •  Fast random access can be achieved when storing metadata in

Flash SSDs. Limited gain for sequen2al access! •  Alterna2vely, archive the data in cheap object stores in the

cloud, but invest in bandwidth!

What can be done about network latency? •  Use high performance network protocols (e.g. UDP-‐based UDT)

to achieve higher speeds that can be achieved with TCP. •  Aspera’s fasp accelerates the transfer in high-‐latency high-‐loss

networks where the transport protocol is a boSleneck. •  Transmission rates can be enhanced using mul2ple concurrent

transfers (mul2-‐part downloads): •  GeneTorrent is a file transfer client applica2on based on

BitTorrent technology (up to 200MB/s over the internet). •  GridFTP (implemented in Globus toolkit) enables reliable and

high-‐speed transmission of very large files (up to ~800MB/s when scp is 17MB/s).

•  High speed internet connec2on StarLight/internet2? Firewall and network security problems.

Alterna2ve Models What types of analyses do we run in genomics? •  Embarrassingly parallel algorithms:

Most sequence analysis soKware have distributed solu2ons. E.g. Alignment, imputa2on, etc. Use genome chunking and run in batches! •  Tightly-‐coupled algorithms:

Some require message passing or shared memory, e.g. genome assembly, pathway analysis. Forms of parallelism: •  Task parallelism: Distribute the execu2on threads across different

nodes. •  Data parallelism: Distribute the data across different execu2on nodes.

Healthcare data need to be stored and analyzed centrally!

Map-‐Reduce Framework A distributed soluGon to a data-‐centric problem: •  Map: Divide up the problem into smaller chunks and send each

compute task to where the data resides. •  Reduce: Collect the answers to each sub-‐problem and combine the

results.

Example: K-‐mer Coun2ng

(ATG:1)!

(TGA:1)!

(GAA:1)!

(AAC:1)!

(ACC:1)!

(CCT:1)!

(CTT:1)!

(TTA:1)!

(GAA:1)!

(AAC:1)!

(ACA:1)!

(CAA:1)!

(AAC:1)!

(ACT:1)!

(CTT:1)!

(TTA:1)!

(TTT:1)!

(TTA:1)!

(TAG:1)!

(AGG:1)!

(GGC:1)!

(GCA:1)!

(CAA:1)!

(AAC:1)!

map reduce

K-mer Counting •! Application developers focus on 2 (+1 internal) functions

–! Map: input ! key:value pairs

–! Shuffle: Group together pairs with same key

–! Reduce: key, value-lists ! output

ATGAACCTTA!

GAACAACTTA!

TTTAGGCAAC!

ACA -> 1!

ATG -> 1!

CAA -> 1,1!

GCA -> 1!

TGA -> 1!

TTA -> 1,1,1!

ACT -> 1!

AGG -> 1!

CCT -> 1!

GGC -> 1!

TTT -> 1!

AAC -> 1,1,1,1!

ACC -> 1!

CTT -> 1,1!

GAA -> 1,1!

TAG -> 1!

ACA:1!

ATG:1!

CAA:2!

GCA:1!

TGA:1!

TTA:3!

ACT:1!

AGG:1!

CCT:1!

GGC:1!

TTT:1!

AAC:4!

ACC:1!

CTT:2!

GAA:2!

TAG:1!

Map, Shuffle & Reduce

All Run in Parallel

shuffle

Michael Schatz

Hadoop Distributed File System (HDFS) •  Apache Hadoop, an open-‐source implementa2on of Google’s MapReduce and

Google File System (GFS). •  A highly reliable and scalable solu2on to storage and processing of massive data

using cheap commodity hardware. •  Op2mised for high throughput access to data. Data is replicated for fault tolerance.

HDFS vs Lustre

Hadoop •  Data is local: data nodes act as

compute nodes. •  I/O is not very relevant here,

although it can be improved by concurrency.

•  Op2mised for batch processing. •  Single-‐node boSlenecks or name

node failures.

Lustre •  Data is shared: compute clients

talk to object store servers. •  High aggregate I/O can be

achieved with striping. •  Op2mised for HPC. Used in

Top500! •  BoSleneck is gerng the data on

lustre!

DAS

CPU

Network

DAS

CPU

DAS

CPU

OST

Client

Network

OST

Client Client

OSS

OST OST

OSS

Bioinforma2c Tools for Hadoop Suites of tools ac2vely under development: •  SeqPig: A library which u2lizes Apache Pig to translate sequence

data analysis into a sequence of MapReduce jobs. •  Seal: A collec2on of distributed applica2ons for alignment and

manipula2on of short read sequence data. •  SeqWare: A toolkit for building high-‐throughput sequencing data

analysis workflows in cloud-‐based environments. •  And many algorithms for sequence mapping (CloudAligner) and

SNP calling (Crossbow), de novo assembly (Contrail), peak calling (PeakRanger) and RNA-‐Seq data analysis (Eoulsan, FX and Myrna).

Virtualiza2on increases u2liza2on of costly hardware: •  En2re workflows as Virtual Machines residing in SAN. VMs are

sent to hypervisors for execu2on.

Storage-‐Area Network (SAN)

VM VM

VM VM

Hardware (CPU, Memory, etc)

Hypervisor (Xen, Hyper-‐V, etc)

Hardware Virtualiza2on

App

OS

App

OS

App

OS

Fiber Channel

Management Console

Cloud Compu2ng What does AWS cloud have to offer? •  Networking: Direct connect, Virtual Private Cloud (VPC),

Route 53 •  Compute: Elas2c Compute Cloud (EC2), Elas2c MapReduce •  Storage: Simple Storage Service (S3), Glacier, Storage

Gateway, CloudFront •  Database: Rela2onal Database Service (RDS), DynamoDB,

Elas2Cache, RedshiK •  Management: Iden2ty and Access Management (IAM),

CloudWatch, CloudForma2on, Elas2c BeanStalk

Network Performance within Amazon

•  Bandwidth speeds within AWS are way too low for moving big genome data.

•  Experiments achieve maximum 70-‐80MB/s speeds between two EC2 instances and 10-‐20MB/s between EC2 and S3.

•  Download from S3 to EC2 is unreliable and constrained given data inges2on over HTTP.

•  Gigabit Ethernet in EC2 is only available with cluster instances •  Enhanced networking using network virtualiza2on may

provide higher I/O performance. •  CloudFront, Amazon’s content delivery service provides

streaming at HD rates only. •  AWS Data Pipeline is not up to the task of big data workflows.

VRPipe in the Cloud To deploy the VRPipe in the cloud one needs to sa2sfy the following requirements (Sendu Bala): •  Set up a DataBase Management System (DBMS) for VRPipe

in the AWS RDS (or use a SQLite or locally installed MySQL database)

•  Create a distributed file system to provide shared access to soKware and data (adjust for speed or redundancy)

•  Configure the VRPipe and provide the required permissions and security creden2als

•  Install and configure a job scheduling system supported by VRPipe, e.g. SGE or LSF

hSps://github.com/VertebrateResequencing/vr-‐pipe/wiki

Tes2ng VRPipe in AWS Cloud Alignment and calling of 110 Phase 3 YRI exomes (~1.1TB): (sequence.index) -‐> 2. 1000genomes_illumina_mapping_with_improvement 27. bam_merge_lanes_and_fix_rgs 61. snp_calling_mpileup 59. snp_calling_gatk_unified_genotyper_and_annotate 89. vcf_gatk_filter 90. vcf_merge 93. vcf_vep_annotate

•  Set up GlusterFS volume using EBS blocks aSached to EC2 instances. •  Enable Elas2c Load Balancing within VPC and grant r/w privileges to DBMS. •  Op2onally use SGE job scheduling in conjunc2on with EC2 load balancing.

Lessons from AWS Cloud •  The bulk of the cloud is made of general purpose hardware suitable

for enterprise compu2ng. •  Scien2fic applica2ons require compute-‐op2mised HPC plaRorms

and high-‐speed I/O and storage. •  On-‐demand services are expensive, but large organiza2ons may

benefit from the economy of scale!? •  As a self-‐service environment, the user should handle sysadmin

tasks including provisioning and configura2on. •  EC2 not being able to compute against S3 (high-‐IO tasks) recalls the

same “D and A” problem! •  ElasFc MapReduce (EMR) runs on EC2 instances, with ephemeral

disks used to build HDFS, so data need to be streamed in/out of S3. •  Virtualiza2on imposes performance penal2es as the available

physical resources are shared among VMs.

Bio-‐cloud Prototypes •  The EBI has developed an in-‐house cloud for public sequence repositories such as

the European Genome-‐Phenome Archive (EGA). •  The NaGonal Center for Biotechnology InformaGon is working on cloud

implementa2ons for storing genomic data such as dbGaP. •  The Beijing Genomics InsGtute has developed five bio-‐cloud compu2ng centers in

different loca2ons that store and process genomes. •  The US NaGonal Cancer InsGtute maintains the Cancer Genome Hub (CGHub)

which is a system for storing large genome data. •  The Broad InsGtute has instan2ated its analysis pipeline for germline and cancer

soma2c data on commercial cloud environments. •  The AMP Lab at UC Berkley has develop and is deploying its genome analysis

pipeline on commercial cloud environments. •  Illumina uploads data directly to the cloud where they have created a plaRorm for

sequence analysis called Basespace. Source: Global Alliance White Paper, 3 June 2013

Data/Pipeline Sharing •  Grid compu2ng in the cloud

enables sharing data and resources across virtualized servers.

•  Cloud APIs enable applica2on inter operability and cross-‐plaRorm compa2bility.

•  Applica2ons are able to launch and access distributed data irrespec2ve of underlying IT infrastructures.

BGI Private Cloud

Sanger Private Cloud

Broad Private Cloud

NCBI Private Cloud

EBI Private Cloud

Public cloud

A Unified PlaRorm? An open source plaRorm for storing, organizing, processing, and sharing very large genomic and biomedical data on premise or in the cloud: •  Data Management: Files and metadata storage, structured/

unstructured data, provenance tracking, security and access control. •  Content addressable Distributed File System: Scalability and fault-‐

tolerance, block storage of data, High performance over low latency. •  Computa2on and Pipeline Processing: Pipeline crea2on tools, revision

control system, MapReduce engine, etc. •  APIs and SDKs: REST and na2ve APIs, w-‐based user interface,

command Line Interface, programming languages and tools, etc. •  Cloud OS and Virtualiza2on: Networking, self-‐service provisioning,

administra2on, block storage, user management, etc.

Discussion •  Compute is much cheaper. Algorithms run faster and more

efficiently. •  Transmission of big data will be a boSleneck. Network latency and

storage I/O are the limi2ng factors. •  Minimize the data flow! •  Distributed file systems have reduced the costs; rou2ne analy2cs of

big data has been made possible using cheap commodity hardware. •  We should feel lucky that sequence analysis is mainly

embarrassingly parallel! •  MapReduce engines may be deployed in genome data centres? •  Cloud compu2ng enables data and applica2on sharing across

consolidated IT infrastructures.

Challenges and Opportunities of Big Data Genomics

Technology

Challenges and Opportunities of Big Data Genomics