Experiencing Apache Spark in Genomics Zhong Wang, Ph.D . Group Lead, Genome Analysis 02/20/2018
Experiencing Apache
Spark in Genomics
Zhong Wang, Ph.D.
Group Lead, Genome Analysis
02/20/2018
Metagenome is the genome of a microbial
community
Microbial communities are “dark matters”
Number of Species
Cow
~6000Human
~1000Soil,
>100000
>90% of the species haven’t been seen before
Metagenome sequencing
Harvestmicrobes
ExtractDNA
Shear, &Sequencing
Assembly
Short Reads
Reconstructed genomes
Microbes Genomic DNA
Metagenome assembly
Library of Books Shredded Library “reconstructed” Library
Genome ~= Book Metagenome ~= Library
Sequencing ~= sampling the pieces
Scale is an enemy
1
10
100
1,000
10,000
100,000
1,000,000
Common Human Cow Soil
Gigabases (Gb)
Complexity is another…
Remove contaminants, sequencing errors
Overlap graphde bruijn graph
Contigs or clustersRepetitive elementsHomologous genesHorizontal transferred genes
The ideal solution and the failed ones
Easy to develop Robust Scale to big data Efficient
BigMem
• Easy to develop
• Expensive
• Not scale
MPI
• Fast
• Hard to develop
• Not robust
Hadoop
• Easy to develop
• Scale
• Slow
Addressing big data: Apache Spark
• New scalable programming paradigm• Compatible with Hadoop-supported
storage systems • Improves efficiency through:
• In-memory computing primitives• General computation graphs
• Improves usability through:• Rich APIs in Java, Scala, Python• Interactive shell
Scale to big data
Efficient
Easy to develop
Robust?
Goal: Metagenome read clustering
Read clustering can reduce metagenome
problem to single-genome problem
• Parallel Processing
• Individualized optimization
Reads Read clusters
Algorithm
2 3
1
Node: ReadEdge: number of k-mers two reads share
Read graph containing all reads Graph Partitioning: LPA
Kmer-mapping reads (KMR)
Graph Construction and Edge Reduction (Edges) LPA
Testing datasets
Human Alzheimer
Transcriptome
Cow Rumen
metagenome
Data type Transcriptome Metagenome
# species ~20,000 >=10,000
Repetitive content medium high
# known species high low
Read type PacBio Illumina
Read length (bases) 0.3-30,000 2x150
# reads 2 million 1.2 billion
Data size 7.6 GB 1 TB
High accuracy on a controlled dataset
Hardware and software environments
OTC EMR Bridge
nodes 20 20 8
cores 8 (160) 8 (160) 28 (224)
memory 64 (1280) 61 (1220) 128 (1024)
Hadoop 2.7.3 2.7.3 2.7.2
Spark 2.1.1 2.2.0 2.1.0
Cow rumen: scale up to big data
0
200
400
600
800
20 40 60 80 100
Execu
tio
n T
ime (
min
s)
Data Size (GB)
KMR Edges LPA Total
OTC
Increasing nodes
0
100
200
300
400
500
25 50 75 100
Exe
cu
tio
n T
ime
(m
ins)
Number of nodes
50G Cow Rumen on EMR
KMR Edges
LPA Total
0
40
80
120
160
5 10 15 20
Exec
uti
on
Tim
e (m
ins)
Number of nodes
10G Cow Rumen on EMR
KMR Edges
LPA Total(mins)
Fine tune parallelism
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8
Execu
tio
n T
Ime (
min
s)
Spark default parallelism (log10)
50G 20G
Dataset complexity vs performance
146.33
44.5
0
20
40
60
80
100
120
140
160
Human Iso-SeqAlzheimer(PacBio)
Cow Rumen(Illumina)Execu
tio
n T
ime (
min
s)
LPA
Edges
KMR
Platform comparison: Cloud vs HPC
OTC EMR Bridge
nodes 20 20 8
cores 8 (160) 8 (160) 28 (224)
memory 64 (1280) 61 (1220) 128 (1024)
Time (min) 106 105 126
Overall impression of Spark
✓ Easy to develop✓ Robust✓ Scale to big data✓ Flexible (cloud, HPC)? Efficient
✓ VS Hadoop/PIG▪ VS MPI?
? Accuracy✅ long reads� Short reads need optimization
Acknowledgements
Spark TeamLizhen Shi @FSU
Xiandong Meng
Lisa Gerhardt , Evan Racah
@ NERSC
Yong Qin, Gary Jung,
Greg Kurtzer, Bernard Li, @ HPC
Philip Blood,
Bryon Gill @PSC