GPU and PC System Architecture UC Santa Cruz BSoE – March 2009 John Tynefield / NVIDIA Corporation.

GPU and PC System Architecture UC Santa Cruz BSoE – March 2009John Tynefield / NVIDIA Corporation

My Goals

Survey history and direction of GPU/PC system architecture

Demonstrate the process of system level architectural problem solving

Motivate some of you to become architects

Disclaimers

I work for NVIDIA

Public Info

All numbers and dates approximateRounding is our friend

No bus/processor is 100% efficient, etc, etc

All examples are meant to be illustrativeNot comprehensive

“ there were >40 gfx companies in 1995”

About Me

I love games and graphics

I love building things

Structure

Intro to PC and GPU Architecture

A Sampling of Architectures1996 - Voodoo Graphics / Pentium

2000 - GeForce 256 / P3

2004 - GeForce 6800 / P4

2008 - Geforce GTX280 / Core2

Ideas for the future of the platform

What do architects do?

Impose structure on complex design problems

Make tradeoffs

Validate high risk design bets

Structure verification

Why this is a great time to be an Architect

Radical design mobilityI have contributed to 10 completely new processor designs

7 of which shipped in millions of units.

Steep competitionNot for everybody

Changing the World…no…really!Heterogeneous many core computing is here to stay and it has changed the nature of computing

Design Tension

Fixed Function vs. Programmable

Scalar vs. Vector

Bandwidth vs. Latency

In Order vs. Out of Order

Limited vs. Unlimited ( virtualized ) resources

Technology Trends

CPUs get faster

GPUs get faster

Interconnects get faster

Memory gets faster

Memory gets denser

Latency increases

Feature load increases

Physics intrudes more and more

All at different rates

1996 2000 2004 20080%

5000%

10000%

15000%

20000%

25000%

30000%

CPU CoresCPU Interconnect BWGPU CoresGPU Interconnect BWSystem Memory BWGPU Memory B/W

The long time horizon

The Awesome ideas of now take 2+ years to reach marketAwesome depreciates rapidly

PredictableSilicon Process Roadmap

PC Arch Roadmap

3rd Party Component Roadmap

Your capabilities and resources

UnpredictableMarket Shifts ( commodity prices, supply shocks )

3rd Party Strategic Errors ( os/platform/partner slips )

Innovative Competition ( N-way struggle for design initiative )

GPU Memory

GPUCPU

Ultra Simplified PC Anatomy

CPUCore Logic

GPU

GPU Memory

System Memory

ProcessorProcessor

Processor

DRAM MGMTDRAM

MGMT

Ultra Simplified GPU Anatomy

Host Logic

DRAM MGMT

Ultra Simplified GPU Anatomy (2) ProcessorProcessor

Processor

DRAM MGMTDRAM

MGMTHost Logic

DRAM MGMT

Geom Gather

GeomProc

TriangleProc

PixelProc

Z / Blend

Memory

GPU Prehistory

1960s – 1970sSingle Purpose BIG IRON

E&S, GE, Lockheed, …

1980s – 1990sGeneral Purpose BIG IRON

Custom ASICs, Workstations

SGI, Sun, Intergraph, ..

1994Maybe we can fit this on a single consumer add-in card?

Fast consumer CPUs with floating pointTry 3D rendering in fixed point!

PCI

VGA and VESA

Id Software’s DOOM

Contract Fabrication facilities offering .6 micron

ASIC design Tools

Enabling Technologies in 1994

1996 3dfx - Voodoo Graphics

PIO Programming Model

Pure Pipelined Graphics

Partial Triangle Setup – FP32

Fixed Point Integer Texture Mapping and Gouraud Shading

Z Buffer and Full OpenGL Blending

All at 1 PPC, all the time, with no caches

32-bit PCI - .09 GB/s

128-bit EDO 50 Mhz DRAM - .8 GB/s

Voodoo Graphics System Architecture

Geom Gather

GeomProc

TriangleProc

PixelProc

Z / Blend

CPUCore Logic

FBI

FB Memory

System Memory

TMUTEX

Memory

GPUCPU

Arch Decision – Triangle Setup

Target 3D Triangle with texture and Gouraud shading3 * XYW RGBA ST = 72 bytes/triangle pre setup

32-bit PCI 33Mhz – 90 MB/s 1.25 M triangles / second speed of light ( 1M is magic )

Observe that post setup3 * XY WRGBAST start values + screen space derivatives + Area

76 bytes/triangle – 1.18M Tris ( still magic )

Setup can be coded on Pentium in ~100 clocks1M triangles on P100 ( mktg happy )

Data-limited setup on chip - >10% die cost

Typical game scenes <<1000 triangles/frame

2000 Nvidia GeForce 256

Decoupled input queuing

Hardware Transform & LightingFP32 FF Transform

FP22 FF Lighting

Complex fixed function pixel shading

4 Pipelines

AGP4X – 1.06 GB/s

256 Bit DDR 300 Mhz Memory – 19.2 GB/s

GeForce 256 System Architecture

Geom Gather

GeomProc

TriangleProc

PixelProc

Z / Blend

CPUCore Logic

GPU

GPU Memory

System Memory

GPUCPU

Architecture Detail – Combiners

Logical fixed function extension of OpenGL Machine

Surface Color = Diffuse * Texture + Specular

Diffuse Color

Texture

Specular

Multi Texture

If one texture is good, more are better

Diff * ( Tex1 + Tex 2 ) + Spec or Diff * Tex1 * Tex2 or …

Diffuse Color

Texture

0.0

Texture

Specular

Diffuse Color

Texture

Texture2

1.0

Specular

Combiners

Cascading Mux / SOP / Mux / SOP pipeline

Very, flexible, harder to program with deeper nestingEverything is full speed!

A MUX B MUX

AB Partial

C MUX D MUX

CD Partial

Inputs for Next Stage of Pipeline

Texture Fog Light

Programmable Shading

But the future was obviously Renderman-like shaders

normal surfaceN; color C = { 1.0, 0.5, 0.0 }; normal lightDirection;

Ci = C * dot ( surfaceN, lightDirection );

2004 Nvidia GeForce 6800

Fully general Vertex and Pixel ISA6 Geometry Processors

16 Pixel Processors

Deep recirculating pipelines to hide latency

FP32 datapath end to end

AGP8X – 2.11 GB/s

256 Bit 700 Mhz GDDR3 – 44 GB/s

GeForce 6800 System Architecture

Geom Gather

GeomProc

TriangleProc

PixelProc

Z / Blend

CPUCore Logic

GPU

GPU Memory

System Memory

GPUCPU

Physics and AI

Scene Mgmt

Architecture Decision – Tex/Shader Structure

Problem: Build a general programmable pipeline

Optimize for common workloads

TEX – BLEND – FOG

Common Game Shaders ( eg. Doom 3 )

Plan A – Uncoupled

Elegant

Small fundamental unit

Many “passes” for common shadersTBF

TEXMTH

TEX

BLND

BLND Reg

iste

rs

Texture Math

Less Elegant

Larger Fundamental Unit

Single pass for common shaders

Good scaling for longer shaders

Big perf / area win given workloads

Not forward looking

Plan B - Coupled

Reg

iste

rs

Math

Texture

Math

2008 - GeForce GTX280

Fully unified programmable architecture

240 instances of the same processor

IEEE FP32 and FP64

Gen2 PCIE – 8GB/s

512 bit 1100 Mhz GDDR3 – 144 GB/s

GeForce GTX280 System Architecture

Geom Gather

GeomProc

TriangleProc

PixelProc

Z / Blend

CPUCore Logic

GPU

GPU Memory

System Memory

GPUCPU

Physics and AI

Scene Mgmt

Architecture Decision – Heterogeneous Computing Support

Build a bigger Chip

Radically improve ability of GPU to share work with the CPU

Thread

Local Memory

Grid 0

. . .

GlobalMemory

. . .

Grid 1

SequentialGridsin Time

Block

SharedMemory

Register File

Computing Support

Add Efficient Thread Launching

Add General Load / Store Instructions and Datapath

Add Shared Memory

Add computational loads to performance design requirements

Future Graphics Directions

Higher density

Higher refresh

Higher dynamic range

UbiquityLower Power

Shaving off the last burrsGlobal Illumination

Higher quality modeling

Virtualized resources at interactive rates

Future PC Architecture Directions

Highly Integrated – Low CostRequire a minimum visual feature set

Web/video/run today’s apps

And everyone elseDifferentiated PCs

More bandwidth and more parallel horsepower

More mature unified programming models C on CUDA

DX11

OpenCL

More resource virtualization

Q & A

GPU and PC System Architecture UC Santa Cruz BSoE – March 2009 John Tynefield / NVIDIA Corporation.

Documents

gbs slide

platform slide

nature of computing

gpu memory gpu cpu

gpu architecture

z blend memory slide

faster memory

gpu prehistory