GPU and PC System Architecture UC Santa Cruz BSoE – March 2009 John Tynefield / NVIDIA Corporation
Dec 14, 2015
GPU and PC System Architecture UC Santa Cruz BSoE – March 2009John Tynefield / NVIDIA Corporation
My Goals
Survey history and direction of GPU/PC system architecture
Demonstrate the process of system level architectural problem solving
Motivate some of you to become architects
Disclaimers
I work for NVIDIA
Public Info
All numbers and dates approximateRounding is our friend
No bus/processor is 100% efficient, etc, etc
All examples are meant to be illustrativeNot comprehensive
“ there were >40 gfx companies in 1995”
About Me
I love games and graphics
I love building things
Structure
Intro to PC and GPU Architecture
A Sampling of Architectures1996 - Voodoo Graphics / Pentium
2000 - GeForce 256 / P3
2004 - GeForce 6800 / P4
2008 - Geforce GTX280 / Core2
Ideas for the future of the platform
What do architects do?
Impose structure on complex design problems
Make tradeoffs
Validate high risk design bets
Structure verification
Why this is a great time to be an Architect
Radical design mobilityI have contributed to 10 completely new processor designs
7 of which shipped in millions of units.
Steep competitionNot for everybody
Changing the World…no…really!Heterogeneous many core computing is here to stay and it has changed the nature of computing
Design Tension
Fixed Function vs. Programmable
Scalar vs. Vector
Bandwidth vs. Latency
In Order vs. Out of Order
Limited vs. Unlimited ( virtualized ) resources
Technology Trends
CPUs get faster
GPUs get faster
Interconnects get faster
Memory gets faster
Memory gets denser
Latency increases
Feature load increases
Physics intrudes more and more
All at different rates
1996 2000 2004 20080%
5000%
10000%
15000%
20000%
25000%
30000%
CPU CoresCPU Interconnect BWGPU CoresGPU Interconnect BWSystem Memory BWGPU Memory B/W
The long time horizon
The Awesome ideas of now take 2+ years to reach marketAwesome depreciates rapidly
PredictableSilicon Process Roadmap
PC Arch Roadmap
3rd Party Component Roadmap
Your capabilities and resources
UnpredictableMarket Shifts ( commodity prices, supply shocks )
3rd Party Strategic Errors ( os/platform/partner slips )
Innovative Competition ( N-way struggle for design initiative )
GPU Memory
GPUCPU
Ultra Simplified PC Anatomy
CPUCore Logic
GPU
GPU Memory
System Memory
ProcessorProcessor
Processor
DRAM MGMTDRAM
MGMT
Ultra Simplified GPU Anatomy
Host Logic
DRAM MGMT
Ultra Simplified GPU Anatomy (2) ProcessorProcessor
Processor
DRAM MGMTDRAM
MGMTHost Logic
DRAM MGMT
Geom Gather
GeomProc
TriangleProc
PixelProc
Z / Blend
Memory
GPU Prehistory
1960s – 1970sSingle Purpose BIG IRON
E&S, GE, Lockheed, …
1980s – 1990sGeneral Purpose BIG IRON
Custom ASICs, Workstations
SGI, Sun, Intergraph, ..
1994Maybe we can fit this on a single consumer add-in card?
Fast consumer CPUs with floating pointTry 3D rendering in fixed point!
PCI
VGA and VESA
Id Software’s DOOM
Contract Fabrication facilities offering .6 micron
ASIC design Tools
Enabling Technologies in 1994
1996 3dfx - Voodoo Graphics
PIO Programming Model
Pure Pipelined Graphics
Partial Triangle Setup – FP32
Fixed Point Integer Texture Mapping and Gouraud Shading
Z Buffer and Full OpenGL Blending
All at 1 PPC, all the time, with no caches
32-bit PCI - .09 GB/s
128-bit EDO 50 Mhz DRAM - .8 GB/s
Voodoo Graphics System Architecture
Geom Gather
GeomProc
TriangleProc
PixelProc
Z / Blend
CPUCore Logic
FBI
FB Memory
System Memory
TMUTEX
Memory
GPUCPU
Arch Decision – Triangle Setup
Target 3D Triangle with texture and Gouraud shading3 * XYW RGBA ST = 72 bytes/triangle pre setup
32-bit PCI 33Mhz – 90 MB/s 1.25 M triangles / second speed of light ( 1M is magic )
Observe that post setup3 * XY WRGBAST start values + screen space derivatives + Area
76 bytes/triangle – 1.18M Tris ( still magic )
Setup can be coded on Pentium in ~100 clocks1M triangles on P100 ( mktg happy )
Data-limited setup on chip - >10% die cost
Typical game scenes <<1000 triangles/frame
2000 Nvidia GeForce 256
Decoupled input queuing
Hardware Transform & LightingFP32 FF Transform
FP22 FF Lighting
Complex fixed function pixel shading
4 Pipelines
AGP4X – 1.06 GB/s
256 Bit DDR 300 Mhz Memory – 19.2 GB/s
GeForce 256 System Architecture
Geom Gather
GeomProc
TriangleProc
PixelProc
Z / Blend
CPUCore Logic
GPU
GPU Memory
System Memory
GPUCPU
Architecture Detail – Combiners
Logical fixed function extension of OpenGL Machine
Surface Color = Diffuse * Texture + Specular
Diffuse Color
Texture
Specular
Multi Texture
If one texture is good, more are better
Diff * ( Tex1 + Tex 2 ) + Spec or Diff * Tex1 * Tex2 or …
Diffuse Color
Texture
0.0
Texture
Specular
Diffuse Color
Texture
Texture2
1.0
Specular
Combiners
Cascading Mux / SOP / Mux / SOP pipeline
Very, flexible, harder to program with deeper nestingEverything is full speed!
A MUX B MUX
AB Partial
C MUX D MUX
CD Partial
Inputs for Next Stage of Pipeline
Texture Fog Light
Programmable Shading
But the future was obviously Renderman-like shaders
normal surfaceN; color C = { 1.0, 0.5, 0.0 }; normal lightDirection;
Ci = C * dot ( surfaceN, lightDirection );
2004 Nvidia GeForce 6800
Fully general Vertex and Pixel ISA6 Geometry Processors
16 Pixel Processors
Deep recirculating pipelines to hide latency
FP32 datapath end to end
AGP8X – 2.11 GB/s
256 Bit 700 Mhz GDDR3 – 44 GB/s
GeForce 6800 System Architecture
Geom Gather
GeomProc
TriangleProc
PixelProc
Z / Blend
CPUCore Logic
GPU
GPU Memory
System Memory
GPUCPU
Physics and AI
Scene Mgmt
Architecture Decision – Tex/Shader Structure
Problem: Build a general programmable pipeline
Optimize for common workloads
TEX – BLEND – FOG
Common Game Shaders ( eg. Doom 3 )
Plan A – Uncoupled
Elegant
Small fundamental unit
Many “passes” for common shadersTBF
TEXMTH
TEX
BLND
BLND Reg
iste
rs
Texture Math
Less Elegant
Larger Fundamental Unit
Single pass for common shaders
Good scaling for longer shaders
Big perf / area win given workloads
Not forward looking
Plan B - Coupled
Reg
iste
rs
Math
Texture
Math
2008 - GeForce GTX280
Fully unified programmable architecture
240 instances of the same processor
IEEE FP32 and FP64
Gen2 PCIE – 8GB/s
512 bit 1100 Mhz GDDR3 – 144 GB/s
GeForce GTX280 System Architecture
Geom Gather
GeomProc
TriangleProc
PixelProc
Z / Blend
CPUCore Logic
GPU
GPU Memory
System Memory
GPUCPU
Physics and AI
Scene Mgmt
Architecture Decision – Heterogeneous Computing Support
Build a bigger Chip
Radically improve ability of GPU to share work with the CPU
Thread
Local Memory
Grid 0
. . .
GlobalMemory
. . .
Grid 1
SequentialGridsin Time
Block
SharedMemory
Register File
Computing Support
Add Efficient Thread Launching
Add General Load / Store Instructions and Datapath
Add Shared Memory
Add computational loads to performance design requirements
Future Graphics Directions
Higher density
Higher refresh
Higher dynamic range
UbiquityLower Power
Shaving off the last burrsGlobal Illumination
Higher quality modeling
Virtualized resources at interactive rates
Future PC Architecture Directions
Highly Integrated – Low CostRequire a minimum visual feature set
Web/video/run today’s apps
And everyone elseDifferentiated PCs
More bandwidth and more parallel horsepower
More mature unified programming models C on CUDA
DX11
OpenCL
More resource virtualization
Q & A