Kenneth Hurley Sr. Software Engineer [email protected].

Kenneth HurleySr. Software Engineer

Kenneth HurleySr. Software Engineer

[email protected]

NVIDIA Corporation

What are the problems we are seeing when 3D engines are written?

• Misuse of Vertex Buffers

• Concurrency Limitations

• Frame Rate Limiters

• Non-Optimized surface usage

• Cache misses

• Data Ordering

NVIDIA Corporation

Misuse of Vertex Buffers

• Bad Things can happen unless you know the “right” way to use a vertex Buffer• Dynamic vertex buffer vs. static vertex buffers

• When creating the vertex buffer, use D3DVBCABS_WRITEONLY

• Use D3DLOCK_DISCARDCONTENTS

• Use D3DLOCK_NOOVERWRITE

• Vertex buffer ordering

• Use ordered vertex buffers because of cache coherency

NVIDIA Corporation

Using Vertex Buffers Correctly

NVIDIA Corporation

Example vertex buffer flow

• CreateVB(WRITEONLY, 1000-12000)

• A: I = 0

• B: Space in VB for M vertices?• Yes: Lock(NOOVERWRITE)• No: GOTO C

• Fill in M vertices at index I

• Unlock(); DIPVB(I); I += M; GOTO B;

• C: Lock(DISCARDCONTENTS) GOTO A

Create VertexBuffer from1000-12000

Room In VertexBuffer?

Yes

Lock(NoOverwrite)

Unlock VertexBuffer/

DrawIndexexPrimVB(Index, length)

Index = 0

StoreVerticesat Index

Index += Numberof Vertices

No

Lock(DiscardContents)

NVIDIA Corporation

Concurrency

• Why do I need it?• Concurrency helps parallelism between the CPU

and the GPU.

• OK, How do I achieve it?• Use NVPAT to see if “Spin Lock” is happening.

• “Spin Locks” are when the driver has to stall waiting for the hardware to finish with an object

• These objects can be vertex buffers or texture surfaces

NVIDIA Corporation

Concurrency (cont.)

• Use the vertex buffer and texture surface flags so the driver can give you another buffer while the hardware is using the other one.

NVIDIA Corporation

Frame Rate Limiters

• Can cause concurrency issues

• Better ways to achieve constant frame rates

• Makes effective triangle rate much lower, because driver has to do some work with vertex data.

NVIDIA Corporation

Frame Rate Limiter Problem

PhysicsArtificial

IntelligenceCulling

SubmitTriangles

Wait for Desired Frame Rate

PhysicsFrame 1

ArtificialIntelligence

Frame 1

CullingFrame 1

SubmitTrianglesFrame 1

T&L/GPU Rasterization

T&L/GPU Rasterization

PhysicsFrame 2

ArtificialIntelligence

Frame 2

CullingFrame 2

Wait forDesired

Frame Rate

Serialization of code loop

Rescheduled for concurrency

NVIDIA Corporation

Non Optimized Surface Usage

• Locking a texture before the GPU is finished with it causes concurrency problems by stalling the CPU inside the driver.

• Typical examples include locking the backbuffer to do 2D operations on it

• The best solution for this is to use 2 screen aligned triangles (quad) instead and put them directly in the 3D pipeline

NVIDIA Corporation

Cache Misses

• Big slowdowns can occur here

• CPU cache misses can occur because of ordering of vertex data. Check these carefully with VTune.

• GPU has a vertex cache also. Geforce has a 16 entry cache, but optimal cache use is 10, because 6 triangles can be “in flight” at any given time.

• GPU vertex cache statistics will be added to NVPAT.

NVIDIA Corporation

Vertex Ordering

• Best performance is to also order vertex data and vertex indices in sequential order. This helps both the CPU and the GPU

• Out of order vertices makes the CPU hit the cache more often

• It does the same thing to the GPU

NVIDIA Corporation

How do we solve these problems?

• VTune

• GPT

• NVPAT

NVIDIA Corporation

VTune 4.5

• Will help your application optimize for CPU

• Works well in conjunction with NVPAT

• I personally use the Time-Based Sampling Wizard

• VTune is excellent for application specific analysis

• It doesn’t show where in the driver time is spent, unless you have symbols for the driver. You almost certainly don’t have driver symbols.

NVIDIA Corporation

VTune 4.5

• Flare Application

NVIDIA Corporation

GPT 3.5

• Excellent tool to help you achieve maximum performance.

• Works on both D3D and OpenGL

• Helps with application API slowdowns

• Works well in conjunction with VTune and NVPAT. GPT is excellent for application to Direct3D/OpenGL analysis.

• It still can’t tell you what is occurring inside the driver that may be slowing your application down

NVIDIA Corporation

GPT 3.5 (cont)

View of alien world in Half-Life*

• Quad view for visual analysis modes

NVIDIA Corporation

NVPAT 1.07

• Analyze interaction with driver

• Works on NVIDIA hardware only

• Windows 98/Windows 2000 capable

• Hotkey capable

• Online help via F1 function key

• Logging

• Frame Rate Display

• Natural Extension to VTune and GPT

NVIDIA Corporation

NVPAT 1.07

• Demo – Flare VS NewFlare

• NVPAT Available free at http://www.nvidia.com/Marketing/Developer/SwDevStaticPages.nsf/pages/StatsDriver

• You must be a registered NVIDIA developer

NVIDIA Corporation

VTune DLL SDK

• Soon, all these performance tools should be integrated into VTune using the DLL SDK

• NVPAT will be integrated into the VTune DLL SDK

• VTune DLL SDK is available from Intel and gives you the ability to integrate performance tools into VTune.

http://developer.intel.com/vtune/analyzer/vtperfdll

• Common User Interface/API means less to learn for developers

NVIDIA Corporation

Action Items

• Profile often and early in the process

• Use the tools available to you

• Some are free, the rest are reasonable

• Architect engine with concurrency in mind

• Ask for enhancements from your tool vendor

NVIDIA Corporation

Questions?

• Comments/Suggestions?

• Enhancement requests for NVPAT can be sent to [email protected]

Kenneth Hurley Sr. Software Engineer [email protected].

Documents

vertex indices

vertex cache statistics

ordering of vertex data

static vertex bufferswhen

optimal cache use

use nvpat

driver symbols

entry cache