Top Banner
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011
30

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Jan 02, 2016

Download

Documents

Luke Carpenter
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

NVIDIA Fermi Architecture

Patrick CozziUniversity of PennsylvaniaCIS 565 - Spring 2011

Page 2: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Administrivia

Assignment 4 grades returned Project checkpoint on Monday

Post an update on your blog beforehand Poster session: 04/28

Three weeks from tomorrow

Page 3: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

G80, GT200, and Fermi

November 2006: G80 June 2008: GT200 March 2011: Fermi (GF100)

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 4: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

New GPU Generation

What are the technical goals for a new GPU generation?

Page 5: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

New GPU Generation

What are the technical goals for a new GPU generation? Improve existing application performance. How?

Page 6: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

New GPU Generation

What are the technical goals for a new GPU generation? Improve existing application performance. How?Advance programmability. In what ways?

Page 7: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi: What’s More?

More total cores (SPs) – not SMs though More registers: 32K per SM More shared memory: up to 48K per SM More Super Functional Units (SFUs)

Page 8: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi: What’s Faster?

Faster double precision – 8x over GT200 Faster atomic operations. What for?

5-20x Faster context switches

Between applications – 10xBetween graphics and compute, e.g.,

OpenGL and CUDA

Page 9: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi: What’s New?

L1 and L2 caches. For compute or graphics?

Dual warp scheduling Concurrent kernel execution C++ support Full IEEE 754-2008 support in hardware Unified address space Error Correcting Code (ECC) memory support Fixed function tessellation for graphics

Page 10: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

G80, GT200, and Fermi

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 11: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

G80, GT200, and Fermi

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 12: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

GT200 and Fermi

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 13: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi Block Diagram

Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

GF100 16 SMs Each with 32 cores

512 total cores Each SM hosts up

to 48 warps, or 1,536 threads

In flight, up to 24,576 threads

Page 14: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi SM

Why 32 cores per SM instead of 8?Why not more SMs?

G80 – 8 cores GT200 – 8 cores GF100 – 32 cores

Page 15: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi SM

Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Dual warp schedulingWhy?

32K registers 32 cores

Floating point and integer unit per core

16 Load/stores 4 SFUs

Page 16: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi SM

Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

16 SMs * 32 cores/SM = 512 floating point operations per cycle

Why not in practice?

Page 17: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi SM

Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Each SM64KB on-chip memory

48KB shared memory / 16KB L1 cache, or

16KB L1 cache / 48 KB shared memory

Configurable by CUDA developer

Page 18: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi Dual Warping Scheduling

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 19: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Slide from: http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_CUDA_luebke_Intro.pdf

Page 20: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi Caches

Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 21: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi Caches

Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 22: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Fermi: Unified Address Space

Page 23: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi: Unified Address Space

64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address space with CPU.

Why?

Page 24: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi: Unified Address Space

64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address space with CPU.

Why?No explicit CPU/GPU copiesDirect GPU-GPU copiesDirect I/O device to GPU copies

Page 25: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi ECC

ECC ProtectedRegister file, L1, L2, DRAM

Uses redundancy to ensure data integrity against cosmic rays flipping bitsFor example, 64 bits is stored as 72 bits

Fix single bit errors, detect multiple bit errors What are the applications?

Page 26: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi Tessellation

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 27: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi Tessellation

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 28: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Fermi Tessellation

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Fixed function hardware on each SM for graphicsTexture filteringTexture cacheTessellationVertex Fetch / Attribute SetupStream OutputViewport Transform. Why?

Page 29: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Observations

Becoming easier to port CPU code to the GPURecursion, fast atomics, L1/L2 caches, faster

global memory In fact…

Page 30: NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Observations

Becoming easier to port CPU code to the GPURecursion, fast atomics, L1/L2 caches, faster

global memory In fact… GPUs are starting to look like CPUs

Beefier SMs, L1 and L2 caches, dual warp scheduling, double precision, fast atomics