Top Banner
PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu
44

PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Dec 31, 2015

Download

Documents

Virgil French
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA

Candidacy Examination08/26/2014

Mucahid Kutlu

Page 2: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

MotivationThe sequencing costs are decreasing Big data problem

Candidacy Examination 2

*Adapted from genome.gov/sequencingcosts *Adapted from https://www.nlm.nih.gov/about/2015CJ.html

Parallel processing is inevitable!

Page 3: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Typical Analysis on Genomic Data

• Single Nucleotide Polymorphism (SNP) calling

Candidacy Examination 3

Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C C

Alig

nmen

t File

-1

Reference A G C G T A C C

Sequences 1 2 3 4 5 6 7 8

Read-1 A G A G

Read-2 A G A G T

Read-3 G A G T

Read-4 G T T C CAlig

nmen

t File

-2

*Adapted from Wikipedia

A single SNP may cause Mendelian disease!

✖ ✓✖

Page 4: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Existing Solutions for Implementation

• Serial tools– SamTools, VCFTools, BedTools – File merging, sorting etc.– VarScan – SNP calling

• Parallel implementations– Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal– Biodoop, statistical analysis

• Middleware Systems– Hadoop

• Not designed for specific needs of genetic data• Limited programmability

– Genome Analysis Tool Kit (GATK)• Designed for genetic data processing• Provides special data traversal patterns• Limited parallelization for some of its tools

Candidacy Examination 4

Page 5: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Main Goal of My Thesis

Candidacy Examination 5

• We want to develop middleware systems– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data

formats – Eases programming since most developers are biologists,

not computer scientists

Page 6: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Papers During My PhD Study• Mucahid Kutlu, Gagan Agrawal. Cluster-based SNP Calling on Large-Scale

Genome Sequencing Data, the 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2014) (Accepted - 19.1% acceptance rate)

• -Mucahid Kutlu, Gagan Agrawal, PAGE: A Framework for Easy PArallelization of GEnomic Applications,the 28th IEEE International Parallel & Distributed Process- ing Symposium (IPDPS 2014) (Accepted - 21.1% acceptance rate)

• -Mucahid Kutlu, Gagan Agrawal and Oguz Kurt, "Fault tolerant parallel data-intensive algorithms," High Performance Computing (HiPC), 2012 (25.1 % acceptance rate)

• -Mucahid Kutlu, Gagan Agrawal and Oguz Kurt, "Fault tolerant parallel data-intensive algorithms", High Performance and Distributed Computing (HPDC), 2012 (poster paper)

• RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications (to be submitted)

Candidacy Examination 6

Page 7: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Outline

• Motivation & Background• Current Work– PAGE: A Framework for Easy PArallelization of GEnomic

Applications– RE-PAGE: Domain-Specific REplication and PArallel

Processing of GEnomic Applications

• Future Work

Candidacy Examination 7

Page 8: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Our Work

• PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications

• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language

Candidacy Examination 8

Page 9: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

File-mFile-2File-1

Map

Reduce

Region-1

Map

Region-n

Intra-dependent Processing

Candidacy Examination 9

O-11

O-1n

Output-1

Map

Reduce

Region-1

Map

Region-n

O-m1

O-mn

Output-m

• Each file is processed independently

Page 10: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Map O1

Ok

On

Reduce Output

Region-1

Input Files

Map

Region-k

Map

Region-n

Inter-dependent Processing• Each map task processes a particular region of ALL files

Candidacy Examination 10

Page 11: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Data Partitioning• Data is NOT packaged into equal-size data blocks as in

Hadoop– Each application has a different way of reading the data– Equal-size data block packaging ignores nucleotide base

location information

• Genome structure is divided into regions and each map task is assigned for a region.– Takes account location information– The map task is responsible of accessing particular region of

the input files• It is a common feature for many genomic tools (GATK, SamTools)

Candidacy Examination 11

Page 12: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Genome Partition

• PAGE provides two data partitioning methods– By-locus partitioning: Chromosomes are divided into

regions

– By-chromosome partitioning: Chromosomes preserve their unity

Candidacy Examination 12

Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6

Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6

Page 13: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Challenges

• Load Imbalance due to nature of genomic data– It is not just an array of

A, G, C and T characters

• High overhead of tasks

• I/O contention

Candidacy Examination 13

1 3 4

Coverage Variance

13

Page 14: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Task Scheduling

Static • Each processor is responsible of regions with equal length.• All map tasks should finish before the execution of reduce

tasks.

Dynamic• Map & reduce tasks are assigned by a master process• Reduce tasks can start if there are enough available

intermediate results.

Candidacy Examination 14

PAGE provides two types of scheduling schemes.

Page 15: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Sample Application Development with PAGE

• Serial execution command of VarScan Software– samtools mpileup –b file_list -f reference | java -jar VarScan.jar mpileup2snp

• To parallelize VarScan with PAGE, user needs to define:– Genome Partition: By-Locus– Scheduling Scheme: Dynamic (or Static)– Execution Model: Inter-dependent– Map command: samtools mpileup –b file_list -r regionloc -f

reference | java -jar VarScan.jar mpileup2snp >outputloc– Reduction : cat bash shell command

Candidacy Examination 15

Page 16: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Experiments

• Experimental Setup– In our cluster

• Each node has 12 GB memory• 8 cores (2.53 GHz)

– We obtained the data from 1000 Human Genome Project– We evaluated PAGE with 4 applications

• VarScan: SNP detection• Realigner Target Creator: Detects insertion/deletions in

alignment files• Indel Realigner: Applies local realignment to improve quality

of alignment files• Unified Genotyper: SNP detection

Candidacy Examination 16

Page 17: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Comparison with GATK

Candidacy Examination 17

Scalability Data Size Impact

- Unified Genotyper tool of GATK

10.9x 12.8x

Data Size: 34 GB # of cores: 128

Page 18: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Scalability Data Size Impact

- VarScan Application

6.9x 12.7x

Comparison with Hadoop Streaming

Candidacy Examination 18

Data Size: 52 GB # of cores: 128

Page 19: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Outline

• Motivation & Background• Current Work– PAGE: A Framework for Easy PArallelization of GEnomic

Applications– RE-PAGE: Domain-Specific REplication and PArallel

Processing of GEnomic Applications

• Future Work

Candidacy Examination 19

Page 20: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications

• In this study, we improve our middleware PAGE from several aspects

• Main goal: Less I/O contention• Main approach: – Utilizing distributed disks– Intelligent replication technique– Scheduling scheme that minimizes network traffic

Candidacy Examination 20

Page 21: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Execution Model

Candidacy Examination 21

Page 22: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Allowing Remote Processing or Not?

Candidacy Examination 22

Advantages Disadvantages

As number of nodes increases, network traffic will increase

Data transfer will be more effective as computation becomes more data intensive

Data transfering can be problematic for large scale data

Better workload balance

Page 23: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Proposed Scheduling Schemes• General idea: Replicate data and prohibit remote processing

– Replication will increase number of local tasks for nodes and be useful to decrease workload imbalance

• Data chunks can have varying sizes and varying replication factors• Master & worker approach• We propose 3 scheduling schemes

– Factoring – Help the busiest node (HBN)– Effective memory management (EMM)

Candidacy Examination 23

FactoringHBNEMM

Page 24: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Proposed Replication Method

• Replicating all chunks into all nodes is not feasible.

• Depending on the analysis we want to perform, some genomic regions can be more important than others for the target analysis.

• General Idea: Replicate important regions more than others.

Candidacy Examination 24

Page 25: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Replication & Distribution

Candidacy Examination 25

Page 26: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Scheduling Scheme Evaluation

Candidacy Examination 26

• Works on real data• 32 nodes (256 cores) • 20 BAM files (21 GB)

• All 3 scheduling schemes are better than random scheduling

• Factoring is the best among all for all experiments

Page 27: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Work Stealing vs. Our Approach

• Synthetic application• Fixed data chunk size,

varying execution time• Performance comparison is

shown: Work Stealing / Our approach

• As processing becomes more data intensive, our approach gives better results!

Candidacy Examination 27

Page 28: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Data Size Impact

Candidacy Examination 28

+%3

+%7

+%4

-%1

• Unified Genotyper• 32 nodes (256 cores)• As data size increases, WS-3

becomes better than WS-1• As data size increases, RE-

PAGE becomes better than WS-3

Page 29: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Candidacy Examination 29

4.2x 7.1x

2.2x

9.9x

Scalability Evaluation

Coverage Analyzer Unified Genotyper

Page 30: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Outline

• Motivation & Background• Current Work– PAGE: A Framework for Easy PArallelization of GEnomic

Applications– RE-PAGE: Domain-Specific REplication and PArallel

Processing of GEnomic Applications

• Future Work

Candidacy Examination 30

Page 31: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Future Work

• An API to Develop Parallel Genomic Applications for Memory Constraint Architectures

• Processing Compressed Genomic Data

Candidacy Examination 31

Page 32: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

API for Memory Constraint Architectures

• We employed CPUs so far

• Co-processors can be also useful for genomic applications

• The trend in computing technologies– More cores, smaller memory– Intel Many Integrated Core (MIC) architecture

Candidacy Examination 32

Page 33: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Proposed Work

• An API which helps user implement parallel genomic applications with memory constraint architectures

• In this work, executables are not used, the developer needs to write map-reduce functions with C programming language

• The middleware helps the developer in 3 ways– Data reading from BAM and Fasta files– Memory utilization– Parallel execution and task scheduling

Candidacy Examination 33

Page 34: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Execution Flow

Candidacy Examination 34

Input Data

Compressed Data

Intermediate Result

Compress Map

Reduce

Input Data

Compressed Data

Intermediate Result

Compress Map

Result

Page 35: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Data Reading

• The middleware reads the data from files and generates genome matrices which are compressed inputs of map tasks.

• The genome matrix can be in two types– Sequence Based: Each row keeps a sequence – Location Based: Keeps the data in mpileup format. Each

row of the matrix keeps information for a different location

Candidacy Examination 35

Page 36: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Genome Matrices

Sequence Based Location Based

Candidacy Examination 36

Page 37: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Optimization of Memory Utilization

• In order to decrease memory usage, we apply two techniques:– Selective Loading– Transparent Compression

Candidacy Examination 37

Page 38: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Selective Loading

• Each read-sequence in Sam/Bam files consist of 11 mandatory and 1 alternative sections – Sequence ID, location, base sequences, strand and others

• For many applications, we do not need all of them.– For counting bases, sequence ids can be ignored

• We load the parts only we need

Candidacy Examination 38

Page 39: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Transparent Compression

• Main Idea: The genome matrices keep the data in compressed format but the developer can access the data with our API as it is uncompressed.

• Compression Technique: Will be investigated

Candidacy Examination 39

Page 40: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Sample Map Taskvoid* map_coveragedepth( location_based_genome_matrix gm){ int i,j,position, indelLength, char* sequence; reduce_object *total; for(i=0;i<gm.number_of_results;i++) { position = getPosition_from_lbgm(gm.code[i],selected_parts) chromosome = get_chromosome_from_lbgm(gm.code[i],selected_parts); for(j=0;j<gm.num_samples;j++) { sequence = get_base_sequence_for_sample_n(gm->code[i], selected_parts, gm.num_samples,j ); count_num_bases(sequence); add_results_to_reduce_object(total, position, chromosome, sequence); } } return (void*)total;}

Input genome matrix

Reduce objectMethods we provide

Candidacy Examination 40

Page 41: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Open Questions

• How to schedule map and reduce tasks?

• How to keep the intermediate results in memory?– Location based genome matrix structure is useful to

decrease the intermediate results.• No need iterative computation for many applications (e.g.

SNP calling)• Reduction is just concatenation of the intermediate results.

So they can be written to the disks as they are produced.

Candidacy Examination 41

Page 42: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

A middleware for processing compressed genomic data

• Compression is useful for archiving concern, however, it decreases the performance

• There are enormous amount of compression method for genomic data– No need to another compression method

• Our goal: A middleware that helps users to process compressed data without fully decompressing it.

Candidacy Examination 42

Page 43: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Execution Model

Candidacy Examination 43

Page 44: PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Candidacy Examination 44

THANKS!