Top Banner
GPU-based Computing
18

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Dec 13, 2015

Download

Documents

Silas Greer
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

GPU-based Computing

Page 2: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Tesla C870 GPU

8 KB / multiprocessor

8 KB / multiprocessor

1.5 GB per GPU

16 KB

up to 768 threads (up to 768 threads (21 bytes of shared memory and 10 registers/thread))

(8192)

Page 3: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Tesla C870 Threads

Page 4: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Example

B blocks of T threads

Which element am I processing?

Page 5: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Case Study: Irregular Terrain Model

•each profile is 1KB

16*16 threads128*16 threads

45 regs

192*16 threads

37 regs

Page 6: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

GPU Strategies for ITM

Page 7: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

GPU Strategies for ITM

Page 8: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

OpenMP vs CUDA

for (i = 0; i < 32; i++){ for (j = 0; j < 32; j++) value=some_function(A[i][j]}

#pragma omp parallel for shared(A) private(i,j)

Page 9: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Easy to Learn

Takes time to master

Page 10: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

For further discussion and additional information

Page 11: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Tesla C870 GPU vs. CELL BEITM Execution Time for 256K Profiles

0.138

2.067

17.729

0

4

8

12

16

20

GPU 1.35GHz CELL BE 2.8GHz GPP 2.4GHz

Exec

utio

n Ti

me

(sec

)

ITM Execution Time for 256K Profiles

0.138

2.067

17.729

0

4

8

12

16

20

GPU 1.35GHz CELL BE 2.8GHz GPP 2.4GHz

Exec

utio

n Ti

me

(sec

)

3072 threads on GPU

15x faster than CELL BE128x faster than GPP

Tesla C1060: 30 cores, double amount of registers ($1,500) with 8X GFLOPS overIntel Xeon W5590 Quad Core ($1600)

Page 12: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

• 32 cores, 1536 threads per core• No complex mechanisms

– speculation, out of order execution, superscalar, instruction level parallelism, branch predictors

• Column access to memory– faster search, read

• First GPU with error correcting code – registers, shared memory, caches, DRAM

• Languages supported – C, C++, FORTRAN, Java, Matlab, and Python

• IEEE 754’08 Double-precision floating point – fused multiply-add operations,

• Streaming and task switching (25 microseconds)– Launching multiple kernels (up to 16) simultaneously

• visualization and computing

Fermi: Next Generation GPU

Page 13: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Hij = max { Hi-1,j-1 + Si,j, Hi-1,j – G, Hi,j-1 – G, 0 } //G = 10

Sequence Alignment

A B C D F W

A 5 -2 -1 -2 -3 -3

B -2 5 -3 -3 -4 -5

C -1 -3 13 -4 -2 -5

D -2 -3 -4 8 -5 -5

F -3 -4 -2 -5 8 4

W -3 -5 -5 -5 4 15

A B C D F W

A 5 -2 -1 -2 -3 -3

B -2 5 -3 -3 -4 -5

C -1 -3 13 -4 -2 -5

D -2 -3 -4 8 -5 -5

F -3 -4 -2 -5 8 4

W -3 -5 -5 -5 4 15

A B C D F W

A 5 -2 -1 -2 -3 -3

B -2 5 -3 -3 -4 -5

C -1 -3 13 -4 -2 -5

D -2 -3 -4 8 -5 -5

F -3 -4 -2 -5 8 4

W -3 -5 -5 -5 4 15

A B C D F W

A 5 -2 -1 -2 -3 -3

B -2 5 -3 -3 -4 -5

C -1 -3 13 -4 -2 -5

D -2 -3 -4 8 -5 -5

F -3 -4 -2 -5 8 4

W -3 -5 -5 -5 4 15

A B C D F W

A 5 -2 -1 -2 -3 -3

B -2 5 -3 -3 -4 -5

C -1 -3 13 -4 -2 -5

D -2 -3 -4 8 -5 -5

F -3 -4 -2 -5 8 4

W -3 -5 -5 -5 4 15

A B C D F W

A 5 -2 -1 -2 -3 -3

B -2 5 -3 -3 -4 -5

C -1 -3 13 -4 -2 -5

D -2 -3 -4 8 -5 -5

F -3 -4 -2 -5 8 4

W -3 -5 -5 -5 4 15

Substituion Matrix

Page 14: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Cost Function

A B C D … Z

A 5 -2 -1 -2 … -3

B -2 5 -3 -3 … -5

C -1 -3 13 -4 … -5

D -2 -3 -4 8 … -5

… … … … … … …

Z -3 -5 -5 -5 … 15

Only need space for 26x26 matrix

New Cost Function

A R N D … X

Q -1 1 0 0 … -1

U -1 -1 -1 -1 … -1

E -1 0 0 2 … -1

R -1 5 0 -2 … -1

Y -2 -2 -2 -3 … -1

… … … … … … …

Previous Methods

Space needed is 23x(Query Length)

Sorted Substitution Table

Computed new table from substitution matrix with substitution characters for top row and query sequence for column*Does not use modulo

Page 15: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Protein Length

GPU (1.35GHz) Time (s)

SSEARCH (3.2GHz) Time(s)

Speedup GPU Cycles (Billions)

SSEARCH Cycles (Billions)

Cycles Ratio

4 1.3 10 8 1.69 32.00 18.968 1.8 12 6.8 2.38 38.40 16.11

16 2.8 26 9.2 3.83 83.20 21.7132 5.8 47 8.1 7.79 150.40 19.3064 11.2 99 8.8 15.17 316.80 20.89

128 22.0 212 9.7 29.64 678.40 22.88256 43.8 428 9.8 59.14 1369.60 23.16512 92.4 886 9.6 124.78 2835.20 22.72768 144.6 1292 8.9 195.15 4134.40 21.19

1024 279.7 1807 6.5 377.55 5782.40 15.32

Alignment Database: Swissprot (Aug 2008), containing 392,768 sequences. GSW vs SSEARCH.

Results

Page 16: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

From Software Perspective

• A kernel is a SPMD-style• Programmer organizes threads into thread blocks;• Kernel consists of a grid of one or more blocks• A thread block is a group of concurrent threads that can

cooperate amongst themselves through – barrier synchronization– a per-block private shared memory space

• Programmer specifies – number of blocks – number of threads per block

• Each thread block = virtual multiprocessor – each thread has a fixed register– each block has a fixed allocation of per-block shared memory.

Page 17: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Efficiency Considerations

• Avoid execution divergence– threads within a warp follow different execution paths. – Divergence between warps is ok

• Allow loading a block of data into SM – process it there, and then write the final result back out to

external memory.

• Coalesce memory accesses– Access executive words instead of gather-scatter

• Create enough parallel work– 5K to 10K threads

Page 18: GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Most important factor

• number of thread blocks =“processor count”

• At thread block (or SM) level– Internal communication is cheap

• Focusing on decomposing the work between the “p” thread blocks