Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data

Mucahid Kutlu Gagan AgrawalDepartment of Computer Science and

Engineering The Ohio State University

CCGrid 2014, Chicago, IL

CCGrid 2014 2

What is SNP?

• Stands for Single-Nucleotide Polymorphism

• DNA sequence variation that occurs when a single nucleotide differs between members of biological species.

• Essential for medical researches and developing personalized-medicine.

• A single SNP may cause a Mendelian disease.

*Adapted from Wikipedia

3

Motivation

• The sequencing costs are decreasing

CCGrid 2014

*Adapted from genome.gov/sequencingcosts

4

• Big data problem– 1000 Human Genome Project already produced 200 TB data

– Parallel processing is inevitable!*Adapted from https://www.nlm.nih.gov/about/2015CJ.html

Motivation

CCGrid 2014

CCGrid 2014 5

Outline

• Motivation• Parallel SNP Calling• Proposed Scheduling Schemes• Experiments• Conclusion

CCGrid 2014 6

General Idea of SNP Calling Algorithms

Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C CAl

ignm

ent F

ile-1

Reference A G C G T A C C

Sequences 1 2 3 4 5 6 7 8

Read-1 A G A G

Read-2 A G A G T

Read-3 G A G T

Read-4 G T T C CAlig

nmen

t File

-2

✖ ✓✖ Two main observations:• In order to detect an SNP

at a certain location, we have to check the alignments in ALL genomes at that location.

• The existence of an SNP is independent than others

CCGrid 2014 7

Parallel SNP Calling

How to distribute data among nodes?

Processor 1

Location-based Sample-based

CCGrid 2014

Proc 2

Proc1

Processor 2

Processor 3

Processor 4

Proc 3

Proc 4

Proc 1

Checkerboard

Proc2

Proc3

Proc4

Genome files Requires communication among processes

CCGrid 2014 8

Challenges

• Load Imbalance due to nature of genomic data– It is not just an array of

A, G, C and T characters

• I/O contention• High overhead of

random access to a particular region

8

1 3 4

Coverage Variance

CCGrid 2014 9

Histogram Showing Coverage Variance

• Chromosome: 1• Locations: 1-200M• Number of

samples: 256• Interval size: 1M

CCGrid 2014 10

Outline


CCGrid 2014 11

Proposed Scheduling Schemes

• Dynamic Scheduling• Static Scheduling• Combined Scheduling

…Each scheduling scheme uses location-based data division. That is, the genome is divided into regions and each task is responsible for a region.

CCGrid 2014 12

Dynamic Scheduling

• Master & Worker Approach• Tasks are assigned dynamically• Two types of data-chunks are

used– Big chunk: covers B locations– Small chunk: cover S locations– B > S

B• Big chunks are assigned first,

then small chunks are assignedB

Alig

nmen

t File

-1Al

ignm

ent F

ile -2

13

Static Scheduling• Pre-processing step

– We count the number of alignments for each region and generate a histogram

• Estimated Cost– We use an estimation function and our histogram

for data partitioning.

– k : histogram interval k– TR : cost of accessing/reading the region

– TP: processing an alignment– N(l): Number of alignments in location l

– Each task is responsible for regions having same estimated cost.

CCGrid 2014Al

ignm

ent F

ile -1

Alig

nmen

t File

-2

• Tasks are scheduled statically. No master & Slave approach

CCGrid 2014 14

Combined Scheduling

• Combination of Static and Dynamic Scheduling

• We use small and big chunks as in dynamic scheduling

• The size of the chunks are determined according to histogram

• Master-Worker approach

Alig

nmen

t File

-1Al

ignm

ent F

ile -2

Big chunks Small chunks

CCGrid 2014 15

Parameters of Scheduling Schemes

• Our proposed scheduling schemes have user-defined parameters– Dynamic Scheduling

• Length of big and small chunks– Static Scheduling

• Histogram interval size• Estimation function parameters

– Combined Scheduling• All parameters for dynamic and static scheduling

• All parameters can be determined with a offline training phase

CCGrid 2014 16

Outline


CCGrid 2014 17

Experiments• Local cluster with nodes• 2 quad-core 2.53 GHz Xeon(R) processors with 12 GB RAM

• We obtained genomes of 256 samples from 1000 Human Genome Project

• The data is replicated to all local disks unless noted otherwise

• Parallel implementation:– We implemented VarScan in C programming language

• We also modified VarScan such that BAM files can be read directly.

– Used MPI library for parallelization

CCGrid 2014 18

Experiments: Scalability

Scheduling Scheme

Scalability

Basic 8.4x

Dynamic 10.9x

Static 19.7x

Combined 23.5x

First 192M location of Chr.1

CCGrid 2014 19

Experiments: Data Size Impact

128 cores are allocated

CCGrid 2014 20

Experiments: I/O Contention Impact

128 cores are allocated

Scheduling Scheme

IO Contention Impact (Sec)

Basic 174

Dynamic 229

Static 251

Combined 220

I/O

Con

tenti

on Im

pact

CCGrid 2014 21

Comparison with Hadoop

- First 192M location of Chr.2 in 512 samples are analyzed

- Lower (dark) portions of the bars show pre-processing time.

IPDPS'14 22

Scheduling With Replication

• Data-Intensive Processing Motivates New Schemes• Replicate each chunk fixed/variable number of times• Dynamic scheduling while processing only local

chunks • Interesting new tradeoffs • Under submission

IPDPS'14 23

Other Work

• PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications (IPDPS 2014)

• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language

IPDPS'14 24

PAGE vs. State-of-the-Art

• A middleware system– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data

formats – Allows use of existing programs

CCGrid 2014 25

Conclusion

• We have developed a methodology for parallel identification of variants in large-scale genome sequencing data.

• Coverage variance and I/O contetion are two main problems

• We proposed 3 scheduling schemes• Combined scheduling gives best results.• Our approach has good speedup and outperforms

Hadoop

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Documents

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.