Exploiting Parallelism on GPUs

Exploiting Parallelism on GPUs

Matt MukerjeeDavid Naylor

Parallelism on GPUs• $100 NVIDIA video card 192 cores– (Build Blacklight for ~$2000 ???)

• Incredibly low power• Ubiquitous

• Question: Use for general computation?– General Purpose GPU (GPGPU)

=?

GPU Hardware• Very specific constraints– Designed to be SIMD (e.g. shaders)– Zero-overhead thread scheduling– Little caching (compared to CPUs)

• Constantly stalled on memory access• MASSIVE # of threads / core• Much finer-grained threads

(“kernels”)

CUDA Architecture

Thread Blocks• GPUs are SIMD

• How does multithreading work?• Threads that branch are halted, then

run• Single Instruction Multiple….?

CUDA is an SIMT architecture

• Single Instruction Multiple Thread• Threads in a block execute the same

instructionMulti-threadedInstruction Unit

ObservationFitting the data structures needed by the threads in one multiprocessor requires application-specific tuning.

Example: MapReduce on CUDA

Too big forcache on one SM!

ProblemOnly one code branch within a block executes at a time

Enhancing SIMT

ProblemIf two multiprocessors share a cache line, there are more memory accesses than necessary.

Data Reordering

Exploiting Parallelism on GPUs

Documents

threads coremuch

runsingle instruction

timeenhancing simt problem

general purpose gpu

general computation

memory accesses

data structures

memory accessmassive