Top Banner
Experiencing Apache Spark in Genomics Zhong Wang, Ph.D . Group Lead, Genome Analysis 02/20/2018
21

Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Jul 17, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Experiencing Apache

Spark in Genomics

Zhong Wang, Ph.D.

Group Lead, Genome Analysis

02/20/2018

Page 2: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Metagenome is the genome of a microbial

community

Page 3: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Microbial communities are “dark matters”

Number of Species

Cow

~6000Human

~1000Soil,

>100000

>90% of the species haven’t been seen before

Page 4: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Metagenome sequencing

Harvestmicrobes

ExtractDNA

Shear, &Sequencing

Assembly

Short Reads

Reconstructed genomes

Microbes Genomic DNA

Page 5: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Metagenome assembly

Library of Books Shredded Library “reconstructed” Library

Genome ~= Book Metagenome ~= Library

Sequencing ~= sampling the pieces

Page 6: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Scale is an enemy

1

10

100

1,000

10,000

100,000

1,000,000

Common Human Cow Soil

Gigabases (Gb)

Page 7: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Complexity is another…

Remove contaminants, sequencing errors

Overlap graphde bruijn graph

Contigs or clustersRepetitive elementsHomologous genesHorizontal transferred genes

Page 8: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

The ideal solution and the failed ones

Easy to develop Robust Scale to big data Efficient

BigMem

• Easy to develop

• Expensive

• Not scale

MPI

• Fast

• Hard to develop

• Not robust

Hadoop

• Easy to develop

• Scale

• Slow

Page 9: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Addressing big data: Apache Spark

• New scalable programming paradigm• Compatible with Hadoop-supported

storage systems • Improves efficiency through:

• In-memory computing primitives• General computation graphs

• Improves usability through:• Rich APIs in Java, Scala, Python• Interactive shell

Scale to big data

Efficient

Easy to develop

Robust?

Page 10: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Goal: Metagenome read clustering

Read clustering can reduce metagenome

problem to single-genome problem

• Parallel Processing

• Individualized optimization

Reads Read clusters

Page 11: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Algorithm

2 3

1

Node: ReadEdge: number of k-mers two reads share

Read graph containing all reads Graph Partitioning: LPA

Kmer-mapping reads (KMR)

Graph Construction and Edge Reduction (Edges) LPA

Page 12: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Testing datasets

Human Alzheimer

Transcriptome

Cow Rumen

metagenome

Data type Transcriptome Metagenome

# species ~20,000 >=10,000

Repetitive content medium high

# known species high low

Read type PacBio Illumina

Read length (bases) 0.3-30,000 2x150

# reads 2 million 1.2 billion

Data size 7.6 GB 1 TB

Page 13: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

High accuracy on a controlled dataset

Page 14: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Hardware and software environments

OTC EMR Bridge

nodes 20 20 8

cores 8 (160) 8 (160) 28 (224)

memory 64 (1280) 61 (1220) 128 (1024)

Hadoop 2.7.3 2.7.3 2.7.2

Spark 2.1.1 2.2.0 2.1.0

Page 15: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Cow rumen: scale up to big data

0

200

400

600

800

20 40 60 80 100

Execu

tio

n T

ime (

min

s)

Data Size (GB)

KMR Edges LPA Total

OTC

Page 16: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Increasing nodes

0

100

200

300

400

500

25 50 75 100

Exe

cu

tio

n T

ime

(m

ins)

Number of nodes

50G Cow Rumen on EMR

KMR Edges

LPA Total

0

40

80

120

160

5 10 15 20

Exec

uti

on

Tim

e (m

ins)

Number of nodes

10G Cow Rumen on EMR

KMR Edges

LPA Total(mins)

Page 17: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Fine tune parallelism

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8

Execu

tio

n T

Ime (

min

s)

Spark default parallelism (log10)

50G 20G

Page 18: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Dataset complexity vs performance

146.33

44.5

0

20

40

60

80

100

120

140

160

Human Iso-SeqAlzheimer(PacBio)

Cow Rumen(Illumina)Execu

tio

n T

ime (

min

s)

LPA

Edges

KMR

Page 19: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Platform comparison: Cloud vs HPC

OTC EMR Bridge

nodes 20 20 8

cores 8 (160) 8 (160) 28 (224)

memory 64 (1280) 61 (1220) 128 (1024)

Time (min) 106 105 126

Page 20: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Overall impression of Spark

✓ Easy to develop✓ Robust✓ Scale to big data✓ Flexible (cloud, HPC)? Efficient

✓ VS Hadoop/PIG▪ VS MPI?

? Accuracy✅ long reads� Short reads need optimization

Page 21: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018

Acknowledgements

Spark TeamLizhen Shi @FSU

Xiandong Meng

Lisa Gerhardt , Evan Racah

@ NERSC

Yong Qin, Gary Jung,

Greg Kurtzer, Bernard Li, @ HPC

Philip Blood,

Bryon Gill @PSC