Top Banner
Exploiting Parallelism on GPUs Matt Mukerjee David Naylor
12

Exploiting Parallelism on GPUs

Feb 22, 2016

Download

Documents

vail

Exploiting Parallelism on GPUs. Matt Mukerjee David Naylor. Parallelism on GPUs. $100 NVIDIA video card  192 cores (Build Blacklight for ~$2000 ???) Incredibly low power Ubiquitous Question: Use for general computation? General Purpose GPU (GPGPU). ?. =. GPU Hardware. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Exploiting Parallelism on GPUs

Exploiting Parallelism on GPUs

Matt MukerjeeDavid Naylor

Page 2: Exploiting Parallelism on GPUs

Parallelism on GPUs• $100 NVIDIA video card 192 cores– (Build Blacklight for ~$2000 ???)

• Incredibly low power• Ubiquitous

• Question: Use for general computation?– General Purpose GPU (GPGPU)

=?

Page 3: Exploiting Parallelism on GPUs

GPU Hardware• Very specific constraints– Designed to be SIMD (e.g. shaders)– Zero-overhead thread scheduling– Little caching (compared to CPUs)

• Constantly stalled on memory access• MASSIVE # of threads / core• Much finer-grained threads

(“kernels”)

Page 4: Exploiting Parallelism on GPUs

CUDA Architecture

Page 5: Exploiting Parallelism on GPUs

Thread Blocks• GPUs are SIMD

• How does multithreading work?• Threads that branch are halted, then

run• Single Instruction Multiple….?

Page 6: Exploiting Parallelism on GPUs

CUDA is an SIMT architecture

• Single Instruction Multiple Thread• Threads in a block execute the same

instructionMulti-threadedInstruction Unit

Page 7: Exploiting Parallelism on GPUs

ObservationFitting the data structures needed by the threads in one multiprocessor requires application-specific tuning.

Page 8: Exploiting Parallelism on GPUs

Example: MapReduce on CUDA

Too big forcache on one SM!

Page 9: Exploiting Parallelism on GPUs

ProblemOnly one code branch within a block executes at a time

Page 10: Exploiting Parallelism on GPUs

Enhancing SIMT

Page 11: Exploiting Parallelism on GPUs

ProblemIf two multiprocessors share a cache line, there are more memory accesses than necessary.

Page 12: Exploiting Parallelism on GPUs

Data Reordering