GPU ACCELERATION OF BIOINFORMATICS PIPELINES Jonathan Cohen and Mark Berger, NVIDIA
Jan 18, 2015
GPU ACCELERATION OF BIOINFORMATICS PIPELINES Jonathan Cohen and Mark Berger, NVIDIA
Agenda
GPU Programming in 10 slides – Cohen (10 minutes)
GPUs for Bioinformatics – Cohen (10 minutes)
Experiences porting SeqAn to CUDA – Siragusa (15 minutes)
Resources – Berger (5 minutes)
Discussion, Q&A – All (20 minutes)
GPU Programming in Ten Slides
CUDA – Programming for Throughput
CPU threads:
Large amount of memory per thread
Full-featured instruction set
1-16 execute simultaneous
CUDA threads:
Lightweight footprint
Full-featured instruction set
10,000 execute simultaneously
CPU Host Executes functions
GPU Device Executes kernels
Run few threads,
each one very fast
Run many threads,
each one slow,
=> total throughput high
CUDA Kernels: Parallel Threads
A kernel is an array of threads,
executed in parallel
All threads execute the same
code
Each thread has an ID
Select input/output data
Control decisions
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
CUDA Kernels: Subdivide into Blocks
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
GPU
Accelerated Computing Multi-core plus Many-cores
CPU Optimized for Serial Tasks
GPU Accelerator Optimized for Many
Parallel Tasks
3-10X+ Comp Thruput 7X Memory Bandwidth
5x Energy Efficiency
How GPU Acceleration Works
Application Code
+
GPU CPU 5% of Code
Compute-Intensive Functions
Rest of Sequential CPU Code
Hello World in CUDA
__global__
void parallel_hello_world()
{
printf(“Hello, world. This is thread %d, block %d!\n”,
threadIdx.x, blockIdx.x);
}
int main()
{
parallel_hello_world<<<128,128>>>();
return 0;
}
> nvcc –o hello_world –arch=sm_30 main.cu
> ./hello_world
Hello, world. This is thread 0, block 0!
Hello, world. This is thread 1, block 0!
...
GPUs for Bioinformatics
Life Technologies
Ion Proton
3 GPUs per Device
S3229 - GPU Accelerated Signal Processing in Ion Proton
Whole Genome Sequencer
Mohit Gupta ( Life Technologies )
Jakob Siegel ( Life Technologies )
https://registration.gputechconf.com/form/session-listing
BGI & NVIDIA
Joint Innovation Lab
SOAP3 Aligner
S3257 - Tackling Big Data in Genomics with GPU
BingQiang Wang (Beijing Genomics Institute)
https://registration.gputechconf.com/form/session-listing
CUDASW++
From Bertil Schmidt’s group: http://cudasw.sourceforge.net/homepage.htm
Y. Liu, A. Wirawan, B. Schmidt: "CUDASW++ 3.0: accelerating Smith-Waterman protein database search
by coupling CPU and GPU SIMD instructions". BMC Bioinformatics, 2013, 14:117.
Performance comparisons on
the Swiss-Prot database.
“On GTX680 (GTX690),
CUDASW++ 3.0 yields an
average performance of 109.4
(169.7) GCUPS, with a
maximum of 119.0 (185.6)
GCUPS.”
NVIDIA GPU Life Science Focus
Molecular Dynamics: All codes are available
AMBER, CHARMM, DESMOND, DL_POLY,
GROMACS, LAMMPS, NAMD
Great multi-GPU performance
GPU codes: Abalone, ACEMD, HOOMD-Blue
Focus: scaling to large numbers of GPUs
Quantum Chemistry: key codes ported or optimizing
Active GPU acceleration projects:
VASP, NWChem, Gaussian, GAMESS, ABINIT,
Quantum Espresso, BigDFT, CP2K, GPAW, etc.
GPU code: TeraChem
Analytical and Medical Imaging Instruments
NVBIO
A GPU based C++ framework for
High Throughput Sequence Analysis
Short & Long Read Alignment
Variant Calling
Compression
…
Overall Design:
flexibility & customizability – a templated library
parallelism at every level
optimize throughput, server-like design
optimize the whole pipeline, not just a single component
(e.g. including data transfers, SAM, BAM, CRAM I/O, …)
A modular library
FM-index
Suffix Trie
Radix Tree
Sorted Dictionary
Edit Distance
Smith-Waterman
Needleman-Wunsch
Gotoh
Banded/Full DP
DP Alignment Tries
Exact Search
Backtracking
Text Search
FASTQ
FASTA
Sequence I/O
SAM
BAM
CRAM
Alignment I/O
HTML report
generators
Support Tools
GPU
CPU
O(1k-10k) threads
O(10-100) threads
nvBowtie2 - Real Datasets
speedup 4.3x
alignment rate +0.5%
disagreement 0.002%
Ion Proton 100M x 175bp (8-350) end-to-end
-
speedup 2.4x
alignment rate =
disagreement 0.006%
Illumina Genome Analyzer II 10M x 100bp x 2 end-to-end
ERR161544
speedup 7.6x
alignment rate -0.6%
disagreement 0.03%
Ion Proton 100M x 175bp (8-350) local
-
speedup 2.6x
alignment rate =
disagreement 0.022%
Illumina Genome Analyzer II 10M x 100bp x 2 local
ERR161544
TT32
NVBIO: efficient sequences analysis on GPUs
Jacopo Pantaleoni Tuesday 2:10 pm, Hall 9
GPU Technology Conference
https://registration.gputechconf.com/form/session-listing
Tag: “Bioinformatics and Genomics”
http://www.gputechconf.com/page/home.html
Google: “GPU Technology Conference”
Resources
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
Maximum
Flexibility
OpenACC
Directives
Easily Accelerate
Applications
GPU Accelerated Libraries “Drop-in” Acceleration for your Applications
Linear Algebra FFT, BLAS,
SPARSE, Matrix
Numerical & Math RAND, Statistics
Data Struct. & AI Sort, Scan, Zero Sum
Visual Processing Image & Video
NVIDIA
cuFFT,
cuBLAS,
cuSPARSE
NVIDIA
Math Lib NVIDIA cuRAND
NVIDIA
NPP
NVIDIA
Video
Encode
GPU AI –
Board
Games
GPU AI –
Path Finding
OpenACC: Open, Simple, Portable
• Open Standard
• Easy, Compiler-Driven Approach
• Portable on GPUs and Xeon Phi
main() {
…
<serial code>
…
#pragma acc kernels
{
<compute intensive code>
}
…
}
Compiler Hint
CAM-SE Climate 6x Faster on GPU 2x Faster on CPU only
Top Kernel: 50% of Runtime
Available from:
GPU Programming Languages
OpenACC, CUDA Fortran Fortran
OpenACC, CUDA C C
Thrust, CUDA C++ C++
PyCUDA, Anaconda Accelerate Python
GPU.NET C#
R, MATLAB, Mathematica, LabVIEW Numerical analytics
Reaching New Developers - CUDA Python Python Productivity + GPU Performance
Easy to Learn
Powerful Libraries
Popular in New Developers
HPC & Data Analytics
Data from CodeEval.com, based on 100k+ code samples
Easiest Way to Learn CUDA
50K Registered
127 Countries
$$
Learn from the Best
Anywhere, Any Time
It’s Free!
Engage with an Active Community
Feedback/Discussion