PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA

Candidacy Examination08/26/2014

Mucahid Kutlu

MotivationThe sequencing costs are decreasing Big data problem

Candidacy Examination 2

*Adapted from genome.gov/sequencingcosts *Adapted from https://www.nlm.nih.gov/about/2015CJ.html

Parallel processing is inevitable!

Typical Analysis on Genomic Data

• Single Nucleotide Polymorphism (SNP) calling


Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C C

Alig

nmen

t File

-1

Reference A G C G T A C C

Sequences 1 2 3 4 5 6 7 8

Read-1 A G A G

Read-2 A G A G T

Read-3 G A G T

Read-4 G T T C CAlig

nmen

t File

-2

*Adapted from Wikipedia

A single SNP may cause Mendelian disease!

✖ ✓✖

Existing Solutions for Implementation

• Serial tools– SamTools, VCFTools, BedTools – File merging, sorting etc.– VarScan – SNP calling

• Parallel implementations– Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal– Biodoop, statistical analysis

• Middleware Systems– Hadoop

• Not designed for specific needs of genetic data• Limited programmability

– Genome Analysis Tool Kit (GATK)• Designed for genetic data processing• Provides special data traversal patterns• Limited parallelization for some of its tools


Main Goal of My Thesis


• We want to develop middleware systems– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data

formats – Eases programming since most developers are biologists,

not computer scientists

Papers During My PhD Study• Mucahid Kutlu, Gagan Agrawal. Cluster-based SNP Calling on Large-Scale

Genome Sequencing Data, the 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2014) (Accepted - 19.1% acceptance rate)

• -Mucahid Kutlu, Gagan Agrawal, PAGE: A Framework for Easy PArallelization of GEnomic Applications,the 28th IEEE International Parallel & Distributed Process- ing Symposium (IPDPS 2014) (Accepted - 21.1% acceptance rate)

• -Mucahid Kutlu, Gagan Agrawal and Oguz Kurt, "Fault tolerant parallel data-intensive algorithms," High Performance Computing (HiPC), 2012 (25.1 % acceptance rate)

• -Mucahid Kutlu, Gagan Agrawal and Oguz Kurt, "Fault tolerant parallel data-intensive algorithms", High Performance and Distributed Computing (HPDC), 2012 (poster paper)

• RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications (to be submitted)


Outline

• Motivation & Background• Current Work– PAGE: A Framework for Easy PArallelization of GEnomic

Applications– RE-PAGE: Domain-Specific REplication and PArallel

Processing of GEnomic Applications

• Future Work


Our Work

• PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications

• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language


File-mFile-2File-1

Map

Reduce

Region-1

Map

Region-n

Intra-dependent Processing


O-11

O-1n

Output-1

Map

Reduce

Region-1

Map

Region-n

O-m1

O-mn

Output-m

• Each file is processed independently

Map O1

Ok

On

Reduce Output

Region-1

Input Files

Map

Region-k

Map

Region-n

Inter-dependent Processing• Each map task processes a particular region of ALL files


Data Partitioning• Data is NOT packaged into equal-size data blocks as in

Hadoop– Each application has a different way of reading the data– Equal-size data block packaging ignores nucleotide base

location information

• Genome structure is divided into regions and each map task is assigned for a region.– Takes account location information– The map task is responsible of accessing particular region of

the input files• It is a common feature for many genomic tools (GATK, SamTools)


Genome Partition

• PAGE provides two data partitioning methods– By-locus partitioning: Chromosomes are divided into

regions

– By-chromosome partitioning: Chromosomes preserve their unity


Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6

Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6

Challenges

• Load Imbalance due to nature of genomic data– It is not just an array of

A, G, C and T characters

• High overhead of tasks

• I/O contention


1 3 4

Coverage Variance

13

Task Scheduling

Static • Each processor is responsible of regions with equal length.• All map tasks should finish before the execution of reduce

tasks.

Dynamic• Map & reduce tasks are assigned by a master process• Reduce tasks can start if there are enough available

intermediate results.


PAGE provides two types of scheduling schemes.

Sample Application Development with PAGE

• Serial execution command of VarScan Software– samtools mpileup –b file_list -f reference | java -jar VarScan.jar mpileup2snp

• To parallelize VarScan with PAGE, user needs to define:– Genome Partition: By-Locus– Scheduling Scheme: Dynamic (or Static)– Execution Model: Inter-dependent– Map command: samtools mpileup –b file_list -r regionloc -f

reference | java -jar VarScan.jar mpileup2snp >outputloc– Reduction : cat bash shell command


Experiments

• Experimental Setup– In our cluster

• Each node has 12 GB memory• 8 cores (2.53 GHz)

– We obtained the data from 1000 Human Genome Project– We evaluated PAGE with 4 applications

• VarScan: SNP detection• Realigner Target Creator: Detects insertion/deletions in

alignment files• Indel Realigner: Applies local realignment to improve quality

of alignment files• Unified Genotyper: SNP detection


Comparison with GATK


Scalability Data Size Impact

- Unified Genotyper tool of GATK

10.9x 12.8x

Data Size: 34 GB # of cores: 128

Scalability Data Size Impact

- VarScan Application

6.9x 12.7x

Comparison with Hadoop Streaming


Data Size: 52 GB # of cores: 128

Outline




• Future Work


RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications

• In this study, we improve our middleware PAGE from several aspects

• Main goal: Less I/O contention• Main approach: – Utilizing distributed disks– Intelligent replication technique– Scheduling scheme that minimizes network traffic


Execution Model


Allowing Remote Processing or Not?


Advantages Disadvantages

As number of nodes increases, network traffic will increase

Data transfer will be more effective as computation becomes more data intensive

Data transfering can be problematic for large scale data

Better workload balance

Proposed Scheduling Schemes• General idea: Replicate data and prohibit remote processing

– Replication will increase number of local tasks for nodes and be useful to decrease workload imbalance

• Data chunks can have varying sizes and varying replication factors• Master & worker approach• We propose 3 scheduling schemes

– Factoring – Help the busiest node (HBN)– Effective memory management (EMM)


FactoringHBNEMM

Proposed Replication Method

• Replicating all chunks into all nodes is not feasible.

• Depending on the analysis we want to perform, some genomic regions can be more important than others for the target analysis.

• General Idea: Replicate important regions more than others.


Replication & Distribution


Scheduling Scheme Evaluation


• Works on real data• 32 nodes (256 cores) • 20 BAM files (21 GB)

• All 3 scheduling schemes are better than random scheduling

• Factoring is the best among all for all experiments

Work Stealing vs. Our Approach

• Synthetic application• Fixed data chunk size,

varying execution time• Performance comparison is

shown: Work Stealing / Our approach

• As processing becomes more data intensive, our approach gives better results!


Data Size Impact


+%3

+%7

+%4

-%1

• Unified Genotyper• 32 nodes (256 cores)• As data size increases, WS-3

becomes better than WS-1• As data size increases, RE-

PAGE becomes better than WS-3


4.2x 7.1x

2.2x

9.9x

Scalability Evaluation

Coverage Analyzer Unified Genotyper

Outline




• Future Work


Future Work

• An API to Develop Parallel Genomic Applications for Memory Constraint Architectures

• Processing Compressed Genomic Data


API for Memory Constraint Architectures

• We employed CPUs so far

• Co-processors can be also useful for genomic applications

• The trend in computing technologies– More cores, smaller memory– Intel Many Integrated Core (MIC) architecture


Proposed Work

• An API which helps user implement parallel genomic applications with memory constraint architectures

• In this work, executables are not used, the developer needs to write map-reduce functions with C programming language

• The middleware helps the developer in 3 ways– Data reading from BAM and Fasta files– Memory utilization– Parallel execution and task scheduling


Execution Flow


Input Data

Compressed Data

Intermediate Result

Compress Map

Reduce

Input Data

Compressed Data

Intermediate Result

Compress Map

Result

Data Reading

• The middleware reads the data from files and generates genome matrices which are compressed inputs of map tasks.

• The genome matrix can be in two types– Sequence Based: Each row keeps a sequence – Location Based: Keeps the data in mpileup format. Each

row of the matrix keeps information for a different location


Genome Matrices

Sequence Based Location Based


Optimization of Memory Utilization

• In order to decrease memory usage, we apply two techniques:– Selective Loading– Transparent Compression


Selective Loading

• Each read-sequence in Sam/Bam files consist of 11 mandatory and 1 alternative sections – Sequence ID, location, base sequences, strand and others

• For many applications, we do not need all of them.– For counting bases, sequence ids can be ignored

• We load the parts only we need


Transparent Compression

• Main Idea: The genome matrices keep the data in compressed format but the developer can access the data with our API as it is uncompressed.

• Compression Technique: Will be investigated


Sample Map Taskvoid* map_coveragedepth( location_based_genome_matrix gm){ int i,j,position, indelLength, char* sequence; reduce_object *total; for(i=0;i<gm.number_of_results;i++) { position = getPosition_from_lbgm(gm.code[i],selected_parts) chromosome = get_chromosome_from_lbgm(gm.code[i],selected_parts); for(j=0;j<gm.num_samples;j++) { sequence = get_base_sequence_for_sample_n(gm->code[i], selected_parts, gm.num_samples,j ); count_num_bases(sequence); add_results_to_reduce_object(total, position, chromosome, sequence); } } return (void*)total;}

Input genome matrix

Reduce objectMethods we provide


Open Questions

• How to schedule map and reduce tasks?

• How to keep the intermediate results in memory?– Location based genome matrix structure is useful to

decrease the intermediate results.• No need iterative computation for many applications (e.g.

SNP calling)• Reduction is just concatenation of the intermediate results.

So they can be written to the disks as they are produced.


A middleware for processing compressed genomic data

• Compression is useful for archiving concern, however, it decreases the performance

• There are enormous amount of compression method for genomic data– No need to another compression method

• Our goal: A middleware that helps users to process compressed data without fully decompressing it.


Execution Model



THANKS!

PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Documents

gagan agrawal

acceptance ratemucahid

domainspecific replication

phd studymucahid kutlu

clusterbased snp

wikipediaa single snp

human genome project

oguz kurt