Top Banner
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University CCGrid 2014, Chicago, IL
25

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Dec 30, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data

Mucahid Kutlu Gagan AgrawalDepartment of Computer Science and

Engineering The Ohio State University

CCGrid 2014, Chicago, IL

Page 2: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 2

What is SNP?

• Stands for Single-Nucleotide Polymorphism

• DNA sequence variation that occurs when a single nucleotide differs between members of biological species.

• Essential for medical researches and developing personalized-medicine.

• A single SNP may cause a Mendelian disease.

*Adapted from Wikipedia

Page 3: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

3

Motivation

• The sequencing costs are decreasing

CCGrid 2014

*Adapted from genome.gov/sequencingcosts

Page 4: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

4

• Big data problem– 1000 Human Genome Project already produced 200 TB data

– Parallel processing is inevitable!*Adapted from https://www.nlm.nih.gov/about/2015CJ.html

Motivation

CCGrid 2014

Page 5: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 5

Outline

• Motivation• Parallel SNP Calling• Proposed Scheduling Schemes• Experiments• Conclusion

Page 6: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 6

General Idea of SNP Calling Algorithms

Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C CAl

ignm

ent F

ile-1

Reference A G C G T A C C

Sequences 1 2 3 4 5 6 7 8

Read-1 A G A G

Read-2 A G A G T

Read-3 G A G T

Read-4 G T T C CAlig

nmen

t File

-2

✖ ✓✖ Two main observations:• In order to detect an SNP

at a certain location, we have to check the alignments in ALL genomes at that location.

• The existence of an SNP is independent than others

Page 7: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 7

Parallel SNP Calling

How to distribute data among nodes?

Processor 1

Location-based Sample-based

CCGrid 2014

Proc 2

Proc1

Processor 2

Processor 3

Processor 4

Proc 3

Proc 4

Proc 1

Checkerboard

Proc2

Proc3

Proc4

Genome files Requires communication among processes

Page 8: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 8

Challenges

• Load Imbalance due to nature of genomic data– It is not just an array of

A, G, C and T characters

• I/O contention• High overhead of

random access to a particular region

8

1 3 4

Coverage Variance

Page 9: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 9

Histogram Showing Coverage Variance

• Chromosome: 1• Locations: 1-200M• Number of

samples: 256• Interval size: 1M

Page 10: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 10

Outline

• Motivation• Parallel SNP Calling• Proposed Scheduling Schemes• Experiments• Conclusion

Page 11: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 11

Proposed Scheduling Schemes

• Dynamic Scheduling• Static Scheduling• Combined Scheduling

…Each scheduling scheme uses location-based data division. That is, the genome is divided into regions and each task is responsible for a region.

Page 12: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 12

Dynamic Scheduling

• Master & Worker Approach• Tasks are assigned dynamically• Two types of data-chunks are

used– Big chunk: covers B locations– Small chunk: cover S locations– B > S

B• Big chunks are assigned first,

then small chunks are assignedB

Alig

nmen

t File

-1Al

ignm

ent F

ile -2

Page 13: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

13

Static Scheduling• Pre-processing step

– We count the number of alignments for each region and generate a histogram

• Estimated Cost– We use an estimation function and our histogram

for data partitioning.

– k : histogram interval k– TR : cost of accessing/reading the region

– TP: processing an alignment– N(l): Number of alignments in location l

– Each task is responsible for regions having same estimated cost.

CCGrid 2014Al

ignm

ent F

ile -1

Alig

nmen

t File

-2

• Tasks are scheduled statically. No master & Slave approach

Page 14: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 14

Combined Scheduling

• Combination of Static and Dynamic Scheduling

• We use small and big chunks as in dynamic scheduling

• The size of the chunks are determined according to histogram

• Master-Worker approach

Alig

nmen

t File

-1Al

ignm

ent F

ile -2

Big chunks Small chunks

Page 15: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 15

Parameters of Scheduling Schemes

• Our proposed scheduling schemes have user-defined parameters– Dynamic Scheduling

• Length of big and small chunks– Static Scheduling

• Histogram interval size• Estimation function parameters

– Combined Scheduling• All parameters for dynamic and static scheduling

• All parameters can be determined with a offline training phase

Page 16: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 16

Outline

• Motivation• Parallel SNP Calling• Proposed Scheduling Schemes• Experiments• Conclusion

Page 17: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 17

Experiments• Local cluster with nodes• 2 quad-core 2.53 GHz Xeon(R) processors with 12 GB RAM

• We obtained genomes of 256 samples from 1000 Human Genome Project

• The data is replicated to all local disks unless noted otherwise

• Parallel implementation:– We implemented VarScan in C programming language

• We also modified VarScan such that BAM files can be read directly.

– Used MPI library for parallelization

Page 18: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 18

Experiments: Scalability

Scheduling Scheme

Scalability

Basic 8.4x

Dynamic 10.9x

Static 19.7x

Combined 23.5x

First 192M location of Chr.1

Page 19: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 19

Experiments: Data Size Impact

128 cores are allocated

Page 20: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 20

Experiments: I/O Contention Impact

128 cores are allocated

Scheduling Scheme

IO Contention Impact (Sec)

Basic 174

Dynamic 229

Static 251

Combined 220

I/O

Con

tenti

on Im

pact

Page 21: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 21

Comparison with Hadoop

- First 192M location of Chr.2 in 512 samples are analyzed

- Lower (dark) portions of the bars show pre-processing time.

Page 22: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

IPDPS'14 22

Scheduling With Replication

• Data-Intensive Processing Motivates New Schemes• Replicate each chunk fixed/variable number of times• Dynamic scheduling while processing only local

chunks • Interesting new tradeoffs • Under submission

Page 23: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

IPDPS'14 23

Other Work

• PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications (IPDPS 2014)

• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language

Page 24: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

IPDPS'14 24

PAGE vs. State-of-the-Art

• A middleware system– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data

formats – Allows use of existing programs

Page 25: Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

CCGrid 2014 25

Conclusion

• We have developed a methodology for parallel identification of variants in large-scale genome sequencing data.

• Coverage variance and I/O contetion are two main problems

• We proposed 3 scheduling schemes• Combined scheduling gives best results.• Our approach has good speedup and outperforms

Hadoop