Top Banner
RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Cluster 2015, Chicago, Illinois Cluster 2015
19

RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Jan 18, 2016

Download

Documents

Raymond Ryan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic

Applications

1

Mucahid Kutlu Gagan AgrawalDepartment of Computer Science and Engineering

The Ohio State University

Cluster 2015, Chicago, Illinois

Cluster 2015

Page 2: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

MotivationThe sequencing costs are decreasing Available data is increasing!

Cluster 2015 2

*Adapted from www.genome.gov/sequencingcosts *Adapted from www.nlm.nih.gov/about/2015CJ.html

Parallel processing is inevitable!

Page 3: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Cluster 2015 3

Typical Analysis on Genomic Data

• Single Nucleotide Polymorphism (SNP) callingSequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C C

Alig

nmen

t File

-1

Reference A G C G T A C C

Sequences 1 2 3 4 5 6 7 8

Read-1 A G A G

Read-2 A G A G T

Read-3 G A G T

Read-4 G T T C CAlig

nmen

t File

-2

*Adapted from Wikipedia

A single SNP may cause Mendelian disease!

✖ ✓✖

Page 4: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

IPDPS'14 4

Existing Solutions for Implementation

• Serial tools– SamTools, VCFTools, BedTools – File merging, sorting etc.– VarScan – SNP calling

• Parallel implementations– Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal– Biodoop, statistical analysis

• Middleware Systems– Hadoop

• Not designed for specific needs of genetic data• Limited programmability

– Genome Analysis Tool Kit (GATK)• Designed for genetic data processing• Provides special data traversal patterns• Limited parallelization for some of its tools

Page 5: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

IPDPS'14 5

Our Goal

• We want to develop a middleware system– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic

algorithms– Be able to work with different popular genetic

data formats – Allows use of existing programs

Page 6: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

IPDPS'14

Challenges• Load Imbalance due

to nature of genomic data– It is not just an array

of A, G, C and T characters

• High overhead of tasks

• I/O contention6

1 3 4

Coverage Variance

Page 7: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

IPDPS'14 7

Background: PAGE (ipdps 14)

• PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications

• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language

Page 8: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Parallel Genomic Applications

• RE-PAGE: A Map-Reduce-like middleware for easy parallelization of data-intensive genomic applications (like PAGE)

• Main goals (unlike PAGE)– Decrease I/O contention by employing a

distributed file system– Workload balance in data intensive tasks– Avoid data transfers

Cluster 2015 8

Page 9: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Execution Model

Cluster 2015 9

Page 10: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

RE-PAGE

• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language

• Applicability– The algorithm should be safe to be parallelized by

processing different regions of the genome independently

– SNP calling, statistical tools and others

Cluster 2015 10

Page 11: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

IPDPS'14 11

RE-PAGE Parallelization• PAGE can parallelize all applications that have

the following property• M - Map task• R, R1 and R2 are three regions such that

R = concatenation of R1 and R2

• M (R) = M(R1) M(R⊕ 2) where is the ⊕reduction function

R1 R2

R

Page 12: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Domain-Specific Data Chunks

• Heuristic: The data in the same genomic location/region can be related and most likely will be processed together for many types of genomic data analysis

• Construct data chunks according to genomic region

Cluster 2015 12

Page 13: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Proposed Replication Method

• Needed to increase data locality• Replicating all chunks into all nodes is not feasible.• Depending on the analysis we want to perform, some

genomic regions can be more important than others for the target analysis.

• General Idea: Replicate important regions more than others.

Cluster 2015 13

Page 14: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Proposed Scheduling Schemes• Problem definition

– Each chunk can be of varying sizes and can have varying number of replicas– Tasks are data intensive. Data transfer costs out-weigh data processing costs

• General approach: – Avoid remote processing – Take advantage of variety in replication factors and data sizes

• Master & worker approach• We propose 3 scheduling schemes

– Largest Chunk First (LCF)– Help the busiest node (HBN)– Effective memory management (EMM)

Cluster 2015 14

LCF HBNEMM

Page 15: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Experiments (1)

Cluster 2015 15

Computation power: 32 Nodes (256 cores) Average Data Chunk Size: 32MBReplication Factor: 3Number of Chunks: 2000

Varying STD of Data Blocks Varying Computation Speed

Average size of chunks in real genomic data: 68MBSTD of chunks sizes in real genomic data: 63MBProcessing Speed: 1MB/sec

STD of chunk sizes : 24MB

Page 16: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Experiments (2)

Cluster 2015 16

Comparison with a Centralized Approach

Computation power: 32 Nodes (256 cores) Replication Factor: 3Application: Coverage Analyzer

Page 17: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Experiments (3)

Cluster 2015 17

Parallel Scalability

Application: Coverage AnalyzerData Size: 15 SAM files (47 GB)Replication factor: 3

Application: Unified GenotyperData Size: 40 BAM files (51 GB)Replication factor: 3 (only RE-PAGE)

4.2x2.2x

7.1x

9.9x

Page 18: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Summary• RE-PAGE for developing parallel data-intensive genomic applications

– Programming• Employs executables of genomic applications• Can parallelize wide range of applications

– Performance• Keeps data in distributed file system• Minimizes data transfer• Employs intelligent replication method

• RE-PAGE outperforms Hadoop and GATK and has good parallel scalability results

• Observation – Prohibiting remote tasks increases performance if chunks have varying sizes and tasks are data intensive.

Cluster 2015 18

Page 19: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Thank you!

Cluster 2015 19