Top Banner
Kenneth Hurley Sr. Software Engineer [email protected]
22
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

Kenneth HurleySr. Software Engineer

Kenneth HurleySr. Software Engineer

[email protected]

Page 2: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

What are the problems we are seeing when 3D engines are written?

• Misuse of Vertex Buffers

• Concurrency Limitations

• Frame Rate Limiters

• Non-Optimized surface usage

• Cache misses

• Data Ordering

Page 3: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

Misuse of Vertex Buffers

• Bad Things can happen unless you know the “right” way to use a vertex Buffer• Dynamic vertex buffer vs. static vertex buffers

• When creating the vertex buffer, use D3DVBCABS_WRITEONLY

• Use D3DLOCK_DISCARDCONTENTS

• Use D3DLOCK_NOOVERWRITE

• Vertex buffer ordering

• Use ordered vertex buffers because of cache coherency

Page 4: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

Using Vertex Buffers Correctly

Page 5: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

Example vertex buffer flow

• CreateVB(WRITEONLY, 1000-12000)

• A: I = 0

• B: Space in VB for M vertices?• Yes: Lock(NOOVERWRITE)• No: GOTO C

• Fill in M vertices at index I

• Unlock(); DIPVB(I); I += M; GOTO B;

• C: Lock(DISCARDCONTENTS) GOTO A

Create VertexBuffer from1000-12000

Room In VertexBuffer?

Yes

Lock(NoOverwrite)

Unlock VertexBuffer/

DrawIndexexPrimVB(Index, length)

Index = 0

StoreVerticesat Index

Index += Numberof Vertices

No

Lock(DiscardContents)

Page 6: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

Concurrency

• Why do I need it?• Concurrency helps parallelism between the CPU

and the GPU.

• OK, How do I achieve it?• Use NVPAT to see if “Spin Lock” is happening.

• “Spin Locks” are when the driver has to stall waiting for the hardware to finish with an object

• These objects can be vertex buffers or texture surfaces

Page 7: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

Concurrency (cont.)

• Use the vertex buffer and texture surface flags so the driver can give you another buffer while the hardware is using the other one.

Page 8: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

Frame Rate Limiters

• Can cause concurrency issues

• Better ways to achieve constant frame rates

• Makes effective triangle rate much lower, because driver has to do some work with vertex data.

Page 9: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

Frame Rate Limiter Problem

PhysicsArtificial

IntelligenceCulling

SubmitTriangles

Wait for Desired Frame Rate

PhysicsFrame 1

ArtificialIntelligence

Frame 1

CullingFrame 1

SubmitTrianglesFrame 1

T&L/GPU Rasterization

T&L/GPU Rasterization

PhysicsFrame 2

ArtificialIntelligence

Frame 2

CullingFrame 2

Wait forDesired

Frame Rate

Serialization of code loop

Rescheduled for concurrency

Page 10: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

Non Optimized Surface Usage

• Locking a texture before the GPU is finished with it causes concurrency problems by stalling the CPU inside the driver.

• Typical examples include locking the backbuffer to do 2D operations on it

• The best solution for this is to use 2 screen aligned triangles (quad) instead and put them directly in the 3D pipeline

Page 11: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

Cache Misses

• Big slowdowns can occur here

• CPU cache misses can occur because of ordering of vertex data. Check these carefully with VTune.

• GPU has a vertex cache also. Geforce has a 16 entry cache, but optimal cache use is 10, because 6 triangles can be “in flight” at any given time.

• GPU vertex cache statistics will be added to NVPAT.

Page 12: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

Vertex Ordering

• Best performance is to also order vertex data and vertex indices in sequential order. This helps both the CPU and the GPU

• Out of order vertices makes the CPU hit the cache more often

• It does the same thing to the GPU

Page 13: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

How do we solve these problems?

• VTune

• GPT

• NVPAT

Page 14: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

VTune 4.5

• Will help your application optimize for CPU

• Works well in conjunction with NVPAT

• I personally use the Time-Based Sampling Wizard

• VTune is excellent for application specific analysis

• It doesn’t show where in the driver time is spent, unless you have symbols for the driver. You almost certainly don’t have driver symbols.

Page 15: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

VTune 4.5

• Flare Application

Page 16: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

GPT 3.5

• Excellent tool to help you achieve maximum performance.

• Works on both D3D and OpenGL

• Helps with application API slowdowns

• Works well in conjunction with VTune and NVPAT. GPT is excellent for application to Direct3D/OpenGL analysis.

• It still can’t tell you what is occurring inside the driver that may be slowing your application down

Page 17: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

GPT 3.5 (cont)

View of alien world in Half-Life*

• Quad view for visual analysis modes

Page 18: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

NVPAT 1.07

• Analyze interaction with driver

• Works on NVIDIA hardware only

• Windows 98/Windows 2000 capable

• Hotkey capable

• Online help via F1 function key

• Logging

• Frame Rate Display

• Natural Extension to VTune and GPT

Page 19: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

NVPAT 1.07

• Demo – Flare VS NewFlare

• NVPAT Available free at http://www.nvidia.com/Marketing/Developer/SwDevStaticPages.nsf/pages/StatsDriver

• You must be a registered NVIDIA developer

Page 20: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

VTune DLL SDK

• Soon, all these performance tools should be integrated into VTune using the DLL SDK

• NVPAT will be integrated into the VTune DLL SDK

• VTune DLL SDK is available from Intel and gives you the ability to integrate performance tools into VTune.

http://developer.intel.com/vtune/analyzer/vtperfdll

• Common User Interface/API means less to learn for developers

Page 21: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

Action Items

• Profile often and early in the process

• Use the tools available to you

• Some are free, the rest are reasonable

• Architect engine with concurrency in mind

• Ask for enhancements from your tool vendor

Page 22: Kenneth Hurley Sr. Software Engineer khurley@nvidia.com.

NVIDIA Corporation

Questions?

• Comments/Suggestions?

• Enhancement requests for NVPAT can be sent to [email protected]