Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data

Mucahid Kutlu Gagan AgrawalDepartment of Computer Science and

Engineering The Ohio State University

CCGrid 2014, Chicago, IL

CCGrid 2014 2

What is SNP?

• Stands for Single-Nucleotide Polymorphism

• DNA sequence variation that occurs when a single nucleotide differs between members of biological species.

• Essential for medical researches and developing personalized-medicine.

• A single SNP may cause a Mendelian disease.

*Adapted from Wikipedia

3

Motivation

• The sequencing costs are decreasing

CCGrid 2014

*Adapted from genome.gov/sequencingcosts

4

• Big data problem– 1000 Human Genome Project already produced 200 TB data

– Parallel processing is inevitable!*Adapted from https://www.nlm.nih.gov/about/2015CJ.html

Motivation

CCGrid 2014

CCGrid 2014 5

Outline

• Motivation• Parallel SNP Calling• Proposed Scheduling Schemes• Experiments• Conclusion

CCGrid 2014 6

General Idea of SNP Calling Algorithms

Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C CAl

ignm

ent F

ile-1

Reference A G C G T A C C

Sequences 1 2 3 4 5 6 7 8

Read-1 A G A G

Read-2 A G A G T

Read-3 G A G T

Read-4 G T T C CAlig

nmen

t File

-2

✖ ✓✖ Two main observations:• In order to detect an SNP

at a certain location, we have to check the alignments in ALL genomes at that location.

• The existence of an SNP is independent than others

CCGrid 2014 7

Parallel SNP Calling

How to distribute data among nodes?

Processor 1

Location-based Sample-based

CCGrid 2014

Proc 2

Proc1

Processor 2

Processor 3

Processor 4

Proc 3

Proc 4

Proc 1

Checkerboard

Proc2

Proc3

Proc4

Genome files Requires communication among processes

CCGrid 2014 8

Challenges

• Load Imbalance due to nature of genomic data– It is not just an array of

A, G, C and T characters

• I/O contention• High overhead of

random access to a particular region

8

1 3 4

Coverage Variance

CCGrid 2014 9

Histogram Showing Coverage Variance

• Chromosome: 1• Locations: 1-200M• Number of

samples: 256• Interval size: 1M

CCGrid 2014 10

Outline


CCGrid 2014 11

Proposed Scheduling Schemes

• Dynamic Scheduling• Static Scheduling• Combined Scheduling

…Each scheduling scheme uses location-based data division. That is, the genome is divided into regions and each task is responsible for a region.

CCGrid 2014 12

Dynamic Scheduling

• Master & Worker Approach• Tasks are assigned dynamically• Two types of data-chunks are

used– Big chunk: covers B locations– Small chunk: cover S locations– B > S

B• Big chunks are assigned first,

then small chunks are assignedB

Alig

nmen

t File

-1Al

ignm

ent F

ile -2

13

Static Scheduling• Pre-processing step

– We count the number of alignments for each region and generate a histogram

• Estimated Cost– We use an estimation function and our histogram

for data partitioning.

– k : histogram interval k– TR : cost of accessing/reading the region

– TP: processing an alignment– N(l): Number of alignments in location l

– Each task is responsible for regions having same estimated cost.

CCGrid 2014Al

ignm

ent F

ile -1

Alig

nmen

t File

-2

• Tasks are scheduled statically. No master & Slave approach

CCGrid 2014 14

Combined Scheduling

• Combination of Static and Dynamic Scheduling

• We use small and big chunks as in dynamic scheduling

• The size of the chunks are determined according to histogram

• Master-Worker approach

Alig

nmen

t File

-1Al

ignm

ent F

ile -2

Big chunks Small chunks

CCGrid 2014 15

Parameters of Scheduling Schemes

• Our proposed scheduling schemes have user-defined parameters– Dynamic Scheduling

• Length of big and small chunks– Static Scheduling

• Histogram interval size• Estimation function parameters

– Combined Scheduling• All parameters for dynamic and static scheduling

• All parameters can be determined with a offline training phase

CCGrid 2014 16

Outline


CCGrid 2014 17

Experiments• Local cluster with nodes• 2 quad-core 2.53 GHz Xeon(R) processors with 12 GB RAM

• We obtained genomes of 256 samples from 1000 Human Genome Project

• The data is replicated to all local disks unless noted otherwise

• Parallel implementation:– We implemented VarScan in C programming language

• We also modified VarScan such that BAM files can be read directly.

– Used MPI library for parallelization

CCGrid 2014 18

Experiments: Scalability

Scheduling Scheme

Scalability

Basic 8.4x

Dynamic 10.9x

Static 19.7x

Combined 23.5x

First 192M location of Chr.1

CCGrid 2014 19

Experiments: Data Size Impact

128 cores are allocated

CCGrid 2014 20

Experiments: I/O Contention Impact

128 cores are allocated

Scheduling Scheme

IO Contention Impact (Sec)

Basic 174

Dynamic 229

Static 251

Combined 220

I/O

Con

tenti

on Im

pact

CCGrid 2014 21

Comparison with Hadoop

- First 192M location of Chr.2 in 512 samples are analyzed

- Lower (dark) portions of the bars show pre-processing time.

IPDPS'14 22

Scheduling With Replication

• Data-Intensive Processing Motivates New Schemes• Replicate each chunk fixed/variable number of times• Dynamic scheduling while processing only local

chunks • Interesting new tradeoffs • Under submission

IPDPS'14 23

Other Work

• PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications (IPDPS 2014)

• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language

IPDPS'14 24

PAGE vs. State-of-the-Art

• A middleware system– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data

formats – Allows use of existing programs

CCGrid 2014 25

Conclusion

• We have developed a methodology for parallel identification of variants in large-scale genome sequencing data.

• Coverage variance and I/O contetion are two main problems

• We proposed 3 scheduling schemes• Combined scheduling gives best results.• Our approach has good speedup and outperforms

Hadoop

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Documents

single snp

parallel snp callinghow

general idea of snp

small chunks

locationbased data division

data partitioning

agagtrea file

assignedbalignment file