GPU-based Computing
Dec 13, 2015
Tesla C870 GPU
8 KB / multiprocessor
8 KB / multiprocessor
1.5 GB per GPU
16 KB
up to 768 threads (up to 768 threads (21 bytes of shared memory and 10 registers/thread))
(8192)
Case Study: Irregular Terrain Model
•each profile is 1KB
16*16 threads128*16 threads
45 regs
192*16 threads
37 regs
OpenMP vs CUDA
for (i = 0; i < 32; i++){ for (j = 0; j < 32; j++) value=some_function(A[i][j]}
#pragma omp parallel for shared(A) private(i,j)
Tesla C870 GPU vs. CELL BEITM Execution Time for 256K Profiles
0.138
2.067
17.729
0
4
8
12
16
20
GPU 1.35GHz CELL BE 2.8GHz GPP 2.4GHz
Exec
utio
n Ti
me
(sec
)
ITM Execution Time for 256K Profiles
0.138
2.067
17.729
0
4
8
12
16
20
GPU 1.35GHz CELL BE 2.8GHz GPP 2.4GHz
Exec
utio
n Ti
me
(sec
)
3072 threads on GPU
15x faster than CELL BE128x faster than GPP
Tesla C1060: 30 cores, double amount of registers ($1,500) with 8X GFLOPS overIntel Xeon W5590 Quad Core ($1600)
• 32 cores, 1536 threads per core• No complex mechanisms
– speculation, out of order execution, superscalar, instruction level parallelism, branch predictors
• Column access to memory– faster search, read
• First GPU with error correcting code – registers, shared memory, caches, DRAM
• Languages supported – C, C++, FORTRAN, Java, Matlab, and Python
• IEEE 754’08 Double-precision floating point – fused multiply-add operations,
• Streaming and task switching (25 microseconds)– Launching multiple kernels (up to 16) simultaneously
• visualization and computing
Fermi: Next Generation GPU
Hij = max { Hi-1,j-1 + Si,j, Hi-1,j – G, Hi,j-1 – G, 0 } //G = 10
Sequence Alignment
A B C D F W
A 5 -2 -1 -2 -3 -3
B -2 5 -3 -3 -4 -5
C -1 -3 13 -4 -2 -5
D -2 -3 -4 8 -5 -5
F -3 -4 -2 -5 8 4
W -3 -5 -5 -5 4 15
A B C D F W
A 5 -2 -1 -2 -3 -3
B -2 5 -3 -3 -4 -5
C -1 -3 13 -4 -2 -5
D -2 -3 -4 8 -5 -5
F -3 -4 -2 -5 8 4
W -3 -5 -5 -5 4 15
A B C D F W
A 5 -2 -1 -2 -3 -3
B -2 5 -3 -3 -4 -5
C -1 -3 13 -4 -2 -5
D -2 -3 -4 8 -5 -5
F -3 -4 -2 -5 8 4
W -3 -5 -5 -5 4 15
A B C D F W
A 5 -2 -1 -2 -3 -3
B -2 5 -3 -3 -4 -5
C -1 -3 13 -4 -2 -5
D -2 -3 -4 8 -5 -5
F -3 -4 -2 -5 8 4
W -3 -5 -5 -5 4 15
A B C D F W
A 5 -2 -1 -2 -3 -3
B -2 5 -3 -3 -4 -5
C -1 -3 13 -4 -2 -5
D -2 -3 -4 8 -5 -5
F -3 -4 -2 -5 8 4
W -3 -5 -5 -5 4 15
A B C D F W
A 5 -2 -1 -2 -3 -3
B -2 5 -3 -3 -4 -5
C -1 -3 13 -4 -2 -5
D -2 -3 -4 8 -5 -5
F -3 -4 -2 -5 8 4
W -3 -5 -5 -5 4 15
Substituion Matrix
Cost Function
A B C D … Z
A 5 -2 -1 -2 … -3
B -2 5 -3 -3 … -5
C -1 -3 13 -4 … -5
D -2 -3 -4 8 … -5
… … … … … … …
Z -3 -5 -5 -5 … 15
Only need space for 26x26 matrix
New Cost Function
A R N D … X
Q -1 1 0 0 … -1
U -1 -1 -1 -1 … -1
E -1 0 0 2 … -1
R -1 5 0 -2 … -1
Y -2 -2 -2 -3 … -1
… … … … … … …
Previous Methods
Space needed is 23x(Query Length)
Sorted Substitution Table
Computed new table from substitution matrix with substitution characters for top row and query sequence for column*Does not use modulo
Protein Length
GPU (1.35GHz) Time (s)
SSEARCH (3.2GHz) Time(s)
Speedup GPU Cycles (Billions)
SSEARCH Cycles (Billions)
Cycles Ratio
4 1.3 10 8 1.69 32.00 18.968 1.8 12 6.8 2.38 38.40 16.11
16 2.8 26 9.2 3.83 83.20 21.7132 5.8 47 8.1 7.79 150.40 19.3064 11.2 99 8.8 15.17 316.80 20.89
128 22.0 212 9.7 29.64 678.40 22.88256 43.8 428 9.8 59.14 1369.60 23.16512 92.4 886 9.6 124.78 2835.20 22.72768 144.6 1292 8.9 195.15 4134.40 21.19
1024 279.7 1807 6.5 377.55 5782.40 15.32
Alignment Database: Swissprot (Aug 2008), containing 392,768 sequences. GSW vs SSEARCH.
Results
From Software Perspective
• A kernel is a SPMD-style• Programmer organizes threads into thread blocks;• Kernel consists of a grid of one or more blocks• A thread block is a group of concurrent threads that can
cooperate amongst themselves through – barrier synchronization– a per-block private shared memory space
• Programmer specifies – number of blocks – number of threads per block
• Each thread block = virtual multiprocessor – each thread has a fixed register– each block has a fixed allocation of per-block shared memory.
Efficiency Considerations
• Avoid execution divergence– threads within a warp follow different execution paths. – Divergence between warps is ok
• Allow loading a block of data into SM – process it there, and then write the final result back out to
external memory.
• Coalesce memory accesses– Access executive words instead of gather-scatter
• Create enough parallel work– 5K to 10K threads