Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University CCGrid 2014, Chicago, IL
Feb 17, 2016
Cluster-based SNP Calling on Large Scale Genome Sequencing Data
Mucahid Kutlu Gagan AgrawalDepartment of Computer Science and
Engineering The Ohio State University
CCGrid 2014, Chicago, IL
CCGrid 2014 2
What is SNP?
• Stands for Single-Nucleotide Polymorphism
• DNA sequence variation that occurs when a single nucleotide differs between members of biological species.
• Essential for medical researches and developing personalized-medicine.
• A single SNP may cause a Mendelian disease.
*Adapted from Wikipedia
3
Motivation
• The sequencing costs are decreasing
CCGrid 2014
*Adapted from genome.gov/sequencingcosts
4
• Big data problem– 1000 Human Genome Project already produced 200 TB data
– Parallel processing is inevitable!*Adapted from https://www.nlm.nih.gov/about/2015CJ.html
Motivation
CCGrid 2014
CCGrid 2014 5
Outline• Motivation• Parallel SNP Calling• Proposed Scheduling Schemes• Experiments• Conclusion
CCGrid 2014 6
General Idea of SNP Calling Algorithms
Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C CAl
ignm
ent F
ile-1
Reference A G C G T A C C
Sequences 1 2 3 4 5 6 7 8Read-1 A G A G
Read-2 A G A G T
Read-3 G A G T
Read-4 G T T C CAlig
nmen
t File
-2
✖ ✓✖ Two main observations:• In order to detect an SNP
at a certain location, we have to check the alignments in ALL genomes at that location.
• The existence of an SNP is independent than others
CCGrid 2014 7
Parallel SNP CallingHow to distribute data among nodes?
Processor 1
Location-based Sample-based
CCGrid 2014
Proc 2
Proc1
Processor 2
Processor 3
Processor 4
Proc 3
Proc 4
Proc 1
Checkerboard
Proc2
Proc3
Proc4
Genome files Requires communication among processes
CCGrid 2014 8
Challenges• Load Imbalance due to
nature of genomic data– It is not just an array of
A, G, C and T characters
• I/O contention• High overhead of
random access to a particular region
8
1 3 4
Coverage Variance
CCGrid 2014 9
Histogram Showing Coverage Variance
• Chromosome: 1• Locations: 1-200M• Number of
samples: 256• Interval size: 1M
CCGrid 2014 10
Outline• Motivation• Parallel SNP Calling• Proposed Scheduling Schemes• Experiments• Conclusion
CCGrid 2014 11
Proposed Scheduling Schemes• Dynamic Scheduling• Static Scheduling• Combined Scheduling
…Each scheduling scheme uses location-based data division. That is, the genome is divided into regions and each task is responsible for a region.
CCGrid 2014 12
Dynamic Scheduling• Master & Worker Approach• Tasks are assigned dynamically• Two types of data-chunks are
used– Big chunk: covers B locations– Small chunk: cover S locations– B > S
B• Big chunks are assigned first,
then small chunks are assignedB
Alig
nmen
t File
-1Al
ignm
ent F
ile -2
13
Static Scheduling• Pre-processing step
– We count the number of alignments for each region and generate a histogram
• Estimated Cost– We use an estimation function and our histogram
for data partitioning.
– k : histogram interval k– TR : cost of accessing/reading the region– TP: processing an alignment– N(l): Number of alignments in location l
– Each task is responsible for regions having same estimated cost.
CCGrid 2014Al
ignm
ent F
ile -1
Alig
nmen
t File
-2
• Tasks are scheduled statically. No master & Slave approach
CCGrid 2014 14
Combined Scheduling• Combination of Static and
Dynamic Scheduling• We use small and big chunks as in
dynamic scheduling• The size of the chunks are
determined according to histogram
• Master-Worker approach
Alig
nmen
t File
-1Al
ignm
ent F
ile -2
Big chunks Small chunks
CCGrid 2014 15
Parameters of Scheduling Schemes
• Our proposed scheduling schemes have user-defined parameters– Dynamic Scheduling
• Length of big and small chunks– Static Scheduling
• Histogram interval size• Estimation function parameters
– Combined Scheduling• All parameters for dynamic and static scheduling
• All parameters can be determined with a offline training phase
CCGrid 2014 16
Outline• Motivation• Parallel SNP Calling• Proposed Scheduling Schemes• Experiments• Conclusion
CCGrid 2014 17
Experiments• Local cluster with nodes• 2 quad-core 2.53 GHz Xeon(R) processors with 12 GB RAM
• We obtained genomes of 256 samples from 1000 Human Genome Project
• The data is replicated to all local disks unless noted otherwise
• Parallel implementation:– We implemented VarScan in C programming language
• We also modified VarScan such that BAM files can be read directly.– Used MPI library for parallelization
CCGrid 2014 18
Experiments: Scalability
Scheduling Scheme
Scalability
Basic 8.4x
Dynamic 10.9x
Static 19.7x
Combined 23.5x
First 192M location of Chr.1
CCGrid 2014 19
Experiments: Data Size Impact
128 cores are allocated
CCGrid 2014 20
Experiments: I/O Contention Impact
128 cores are allocated
Scheduling Scheme
IO Contention Impact (Sec)
Basic 174
Dynamic 229
Static 251
Combined 220
I/O C
onte
ntion
Impa
ct
CCGrid 2014 21
Comparison with Hadoop
- First 192M location of Chr.2 in 512 samples are analyzed
- Lower (dark) portions of the bars show pre-processing time.
IPDPS'14 22
Scheduling With Replication• Data-Intensive Processing Motivates New Schemes• Replicate each chunk fixed/variable number of times• Dynamic scheduling while processing only local
chunks • Interesting new tradeoffs • Under submission
IPDPS'14 23
Other Work• PAGE: A Map-Reduce-like middleware for easy
parallelization of genomic applications (IPDPS 2014)
• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language
IPDPS'14 24
PAGE vs. State-of-the-Art• A middleware system– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data
formats – Allows use of existing programs
CCGrid 2014 25
Conclusion• We have developed a methodology for parallel
identification of variants in large-scale genome sequencing data.
• Coverage variance and I/O contetion are two main problems
• We proposed 3 scheduling schemes• Combined scheduling gives best results.• Our approach has good speedup and outperforms
Hadoop