Top Banner
1 NVIDIA GPU Architecture for General Purpose Computing Anthony Lippert 4/27/09
26

anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

May 04, 2018

Download

Documents

vutu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

1

NVIDIA GPU Architecture

for General Purpose Computing

Anthony Lippert4/27/09

Page 2: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

2

Outline

Introduction GPU Hardware Programming Model Performance Results Supercomputing Products Conclusion

Page 3: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

3

Intoduction

GPU: Graphics Processing Unit Hundreds of Cores Programmable Can be easily installed in most desktops Similar price to CPU GPU follows Moore's Law better than CPU

Page 4: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

4

IntroductionMotivation:

Page 5: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

5

GPU Hardware

Multiprocessor Structure:

Page 6: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

6

GPU Hardware

Multiprocessor Structure: N multiprocessors with M

cores each SIMD – Cores share an

Instruction Unit with othercores in a multiprocessor.

Diverging threads may notexecute in parallel.

Page 7: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

7

GPU Hardware

Memory Hierarchy: Processors have 32-bit registers

Multiprocessors have sharedmemory, constant cache, andtexture cache

Constant/texture cache are read-only and have faster access thanshared memory.

Page 8: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

8

GPU HardwareNVIDIA GTX280 Specifications: 933 GFLOPS peak performance 10 thread processing clusters (TPC)‏ 3 multiprocessors per TPC 8 cores per multiprocessor 16384 registers per multiprocessor 16 KB shared memory per multiprocessor 64 KB constant cache per multiprocessor 6 KB < texture cache < 8 KB per multiprocessor 1.3 GHz clock rate Single and double-precision floating-point calculation 1 GB DDR3 dedicated memory

Page 9: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

9

GPU Hardware

Thread Scheduler

Thread Processing

Clusters

Atomic/Tex L2

Memory

Page 10: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

10

GPU Hardware

Thread Scheduler: Hardware-based Manages scheduling threads across thread

processing clusters Nearly 100% utilization: If a thread is waiting for

memory access, the scheduler can perform azero-cost, immediate context switch to anotherthread

Up to 30,720 threads on the chip

Page 11: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

11

GPU HardwareThread Processing Cluster:

IU - instruction unit TF - texture filtering

Page 12: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

12

GPU Hardware

Atomic/Tex L2: Level 2 Cache Shared by all thread processing clusters Atomic

− Ability to perform read-modify-write operations tomemory

− Allows granular access to memory locations− Provides parallel reductions and parallel data

structure management

Page 13: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

13

GPU Hardware

Page 14: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

14

GPU Hardware

GT200 Power Features: Dynamic power management Power consumption is based on utilization

− Idle/2D power mode: 25 W− Blu-ray DVD playback mode: 35 W− Full 3D performance mode: worst case 236 W− HybridPower mode: 0 W

On an nForce motherboard, when notperforming, the GPU can be powered off andcomputation can be diverted to the motherboardGPU (mGPU)‏

Page 15: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

15

GPU Hardware10 Thread ProcessingClusters(TPC)‏

3 multiprocessors per TPC

8 cores per multiprocessor

ROP – raster operation

processors (for graphics) ‏

1024 MB frame buffer for

displaying images

Texture (L2) Cache

Page 16: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

16

Programming ModelPast: The GPU was intended for graphics only, not general

purpose computing. The programmer needed to rewrite the program in a

graphics language, such as OpenGL ComplicatedPresent: NVIDIA developed CUDA, a language for general

purpose GPU computing Simple

Page 17: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

17

Programming ModelCUDA: Compute Unified Device Architecture Extension of the C language Used to control the device The programmer specifies CPU and GPU

functions− The host code can be C++− Device code may only be C

The programmer specifies thread layout

Page 18: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

18

Programming ModelThread Layout: Threads are organized intoblocks.

Blocks are organized into agrid.

A multiprocessor executesone block at a time.

A warp is the set of threadsexecuted in parallel

32 threads in a warp

Page 19: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

19

Programming Model Heterogeneous Computing:

− GPU and CPU executedifferent types of code.

− CPU runs the mainprogram, sending tasks tothe GPU in the form ofkernel functions

− Multiple kernel functionsmay be declared and called.

− Only one kernel may becalled at a time.

Page 20: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

20

Programming Model: GPU vs. CPU Code

D. Kirk. Parallel Computing: What has changed lately? Supercomputing, 2007

Page 21: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

21

Performance Results

Page 22: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

22

Supercomputing ProductsTesla C1060 GPU933 GFLOPS

nForce Motherboard

Tesla C1070 Blade 4.14 TFLOPS

Page 23: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

23

Supercomputing Products

Tesla C1060: Similar to GTX 280 No video connections 933 GFLOPS peak performance 4 GB DDR3 dedicated memory 187.8 W max power consumption

Page 24: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

24

Supercomputing Products

Tesla C1070: Server Blade 4.14 TFLOPS peak performance Contains 4 Tesla GPUs 960 Cores 16GB DDR3 408 GB/s bandwidth 800W max power consumption

Page 25: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

25

Conclusion

SIMD causes some problems GPU computing is a good choice for fine-grained data-parallel

programs with limited communication GPU computing is not so good for coarse-grained programs

with a lot of communication The GPU has become a co-processor to the CPU

Page 26: anthony Nvidia Gpu Architecture - William & Marykemper/cs654/slides/nvidia.pdf · 6 GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD – Cores share

26

ReferencesD. Kirk. Parallel Computing: What has changed lately? Supercomputing, 2007.

nvidia.com

NVIDIA. NVIDIA GeForce GTX 200 GPU Architectural Overview. May, 2008.

NVIDIA. NVIDIA CUDA Programming Guide 2.1. 2008.