Top Banner
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2014, Phoenix, Arizona
26

PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

Dec 29, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

PAGE: A Framework for Easy Parallelization of Genomic

Applications

1

Mucahid Kutlu Gagan AgrawalDepartment of Computer Science and Engineering

The Ohio State University

IPDPS 2014, Phoenix, Arizona

Page 2: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 2

Motivation

• The sequencing costs are decreasing

*Adapted from genome.gov/sequencingcosts

Page 3: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 3

• Big data problem– 1000 Human Genome Project already produced 200 TB data

– Parallel processing is inevitable!*Adapted from https://www.nlm.nih.gov/about/2015CJ.html

Motivation

Page 4: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 4

Typical Analysis on Genomic Data

• Single Nucleotide Polymorphism (SNP) calling

Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C C

Alig

nmen

t File

-1

Reference A G C G T A C C

Sequences 1 2 3 4 5 6 7 8

Read-1 A G A G

Read-2 A G A G T

Read-3 G A G T

Read-4 G T T C CAlig

nmen

t File

-2

*Adapted from Wikipedia

A single SNP may cause Mendelian disease!

✖ ✓✖

Page 5: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 5

Outline

• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion

Page 6: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 6

Existing Solutions for Implementation

• Serial tools– SamTools, VCFTools, BedTools – File merging, sorting etc.– VarScan – SNP calling

• Parallel implementations– Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal– Biodoop, statistical analysis

• Middleware Systems– Hadoop

• Not designed for specific needs of genetic data• Limited programmability

– Genome Analysis Tool Kit (GATK)• Designed for genetic data processing• Provides special data traversal patterns• Limited parallelization for some of its tools

Page 7: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 7

Outline

• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion

Page 8: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 8

Our Goal

• We want to develop a middleware system– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data

formats – Allows use of existing programs

Page 9: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14

Challenges

• Load Imbalance due to nature of genomic data– It is not just an array of

A, G, C and T characters

• High overhead of tasks

• I/O contention

9

1 3 4

Coverage Variance

Page 10: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 10

Our Work

• PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications

• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language

Page 11: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 11

File-mFile-2File-1

Map

Reduce

Region-1

Map

Region-n

Intra-dependent Processing

O-11

O-1n

Output-1

Map

Reduce

Region-1

Map

Region-n

O-m1

O-mn

Output-m

• Each file is processed independently

Page 12: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 12

Map O1

Ok

On

Reduce Output

Region-1

Input Files

Map

Region-k

Map

Region-n

Inter-dependent Processing• Each map task processes a particular region of ALL files

Page 13: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 13

What Can PAGE Parallelize?• PAGE can parallelize all applications that have the

following property• M - Map task• R, R1 and R2 are three regions such that

R = concatenation of R1 and R2

• M (R) = M(R1) M(R⊕ 2) where is the reduction ⊕function

R1 R2

R

Page 14: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 14

Data Partitioning• Data is NOT packaged into equal-size data blocks as in

Hadoop– Each application has a different way of reading the data– Equal-size data block packaging ignores nucleotide base

location information

• Genome structure is divided into regions and each map task is assigned for a region.– Takes account location information– The map task is responsible of accessing particular region of

the input files• It is a common feature for many genomic tools (GATK, SamTools)

Page 15: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 15

Genome Partition

• PAGE provides two data partitioning methods– By-locus partitioning: Chromosomes are divided into

regions

– By-chromosome partitioning: Chromosomes preserve their unity

Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6

Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6

Page 16: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 16

Task Scheduling

Static • Each processor is responsible of regions with equal length.• All map tasks should finish before the execution of reduce

tasks.

Dynamic• Map & reduce tasks are assigned by a master process• Reduce tasks can start if there are enough available

intermediate results.

PAGE provides two types of scheduling schemes.

Page 17: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 17

Applications Developed Using PAGE

• We parallelized 4 applications– VarScan: SNP detection– Realigner Target Creator: Detects insertion/deletions in

alignment files– Indel Realigner: Applies local realignment to improve

quality of alignment files– Unified Genotyper: SNP detection

Page 18: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 18

Sample Application Development with PAGE

• Serial execution command of VarScan Software– samtools mpileup –b file_list -f reference | java -jar VarScan.jar mpileup2snp

• To parallelize VarScan with PAGE, user needs to define:– Genome Partition: By-Locus– Scheduling Scheme: Dynamic (or Static)– Execution Model: Inter-dependent– Map command: samtools mpileup –b file_list -r regionloc -f

reference | java -jar VarScan.jar mpileup2snp >outputloc– Reduction : cat bash shell command

Page 19: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 19

Outline

• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion

Page 20: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 20

Experiments

• Experimental Setup– In our cluster

• Each node has 12 GB memory• 8 cores (2.53 GHz)

– We obtained the data from 1000 Human Genome Project– We evaluated PAGE with 4 applications– We compared PAGE with Hadoop Streaming and GATK

Page 21: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 21

Comparison with GATK

Scalability Data Size Impact

- Indel Realigner tool of GATK

Data Size: 11 GB # of cores: 128

3.3x

9x

Page 22: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 22

Comparison with GATK

Scalability Data Size Impact

- Unified Genotyper tool of GATK

10.9x 12.8x

Data Size: 34 GB # of cores: 128

Page 23: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 23

Scalability Data Size Impact

- VarScan Application

6.9x 12.7x

Comparison with Hadoop Streaming

Data Size: 52 GB # of cores: 128

Page 24: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 24

Summary of Experimental Results

When the computing power increased by 16 times

Indel Realigner

Unified Genotyper

VarScan Realigner Target Creator

PAGE 9x 12.8x 12.7x 14.1x

GATK 3.3x 10.9x - -

Hadoop Streaming

- - 6.9x -

Page 25: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 25

Conclusion

• We developed a middleware – Easily parallelizes genomic applications– High applicability

• No restriction on programming language or data format• Allows to use existing applications

– Provides user to control the parallel execution while hiding the details

• Alternative scheduling schemes, execution models and data partitioning types

– Good Scalability

Page 26: PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

IPDPS'14 26

Thank you for listening …

Questions