Top Banner
Performance tips for Windows Store apps using DirectX and C++ Max McMullen Principal Development Lead – Direct3D Microsoft Corporation 4-102
45

Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Dec 17, 2015

Download

Documents

Kory Banks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Performance tips for Windows Store apps using DirectX and C++Max McMullenPrincipal Development Lead – Direct3DMicrosoft Corporation4-102

Page 2: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Overview

Measuring rendering performance

Power efficient GPU characteristics

Optimizing for power efficient GPUs

Agenda

Page 3: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Overview

Page 4: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Optimizing for the Windows 8/RT OSNew form factors and platforms require new optimizations

Windows uses DirectX to get every pixel on screen

Direct3D 11.1 provides new APIs to optimize rendering

Page 5: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Use optimized Windows 8/RT platformsAll Windows Store apps use DirectX for rendering

WWA & XAML optimized use of Direct2D and Direct3D 11.1

Direct2D and Direct2D Effects fully leverage Direct3D 11.1

But sometimes you really need to use Direct3D itself…

Page 6: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

What you should know

Basics of building a C++ Windows Store app

Direct3D fundamentals

Page 7: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Measuring rendering performance

Page 8: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Many useful tools for Windows performance optimization:Visual Studio Performance Profiler, Visual Studio Graphics Diagnostics, hardware partner tools…

Two primary tools used to optimize Direct3D usage in the Windows 8/RT OS:Basic: FPS/time measurement in app/microbenchmarksAdvanced: GPUView

How do you measure rendering performance?

Page 9: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Frames per second (FPS)Quick but sometimes misleading

C++/DirectX Windows Store apps sync to the display refresh

Measure render time, not presentCall ID3D11DeviceContext::Flush instead of IDXGISwapchain::Present

Infrequent output: file output

Frequent output: look at FPSCounter.cpp in the GeometryRealization sample

Page 10: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Demo: FPS measurement

Page 11: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

GPUView

Part of the Windows Performance Toolkit

ETW Logging of CPU and GPU work

Measures graphics performanceFPS, startup time, glitching, render time, latency

Enables detailed analysis of CPU and GPU workloads and interdependencies

Page 12: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

GPUView – Record and AnalyzeInstallx86: Windows Performance ToolkitARM: Windows Kits\8.0\Windows Performance Toolkit\Redistributables\WPTarm-arm_en-us.msi

RecordRun log.cmd to startPerform actionRun log.cmd to stop

AnalyzeData captured in merged.etl, load in GPUView

Page 13: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

GPUView - Interface

CPU Threads

Flip Queue

CPU Queues

GPU Hardware Queue

Page 14: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

GPUView Interface: GPU Hardware Queue

The GPU Hardware Queue shows command buffers rendering on the GPU.CPU Queue command buffers moved to the GPU Hardware Queue when the hardware is ready to receive more commands.

Page 15: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Demo: GPUView

Page 16: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Power efficient GPU characteristics

Page 17: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

What to expect with power efficient GPUsFeature level 9_1 or 9_3

Limited available bandwidth

Both immediate render and tiled render GPUs

Limited shader instruction throughput

Page 18: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Feature Level 9.x (FL9.1, FL9.3)

Real-time render limitations generally occur before reaching these maximums

Feature Level 9.1 9.3

Texture size 2048x2048 4096x4096

Pixel shader instructions

64 arithmetic, 32 sample

512 total

Page 19: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

GPU Memory BandwidthBaseline requirement: 1.9 GB/sec benchmarked

7.5 I/O operation per screen pixel, 1366x768x32bpp@60hz

I/O Cost Operation

1 Screen Fill w/Solid Color

2 Screen Fill w/Texture

3 Screen Fill w/Texture & Alpha Blend

Page 20: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Immediate render

GPUshader cores

Memory bus

Graphics memory

Page 21: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Tiled render

GPUshader cores

Memory bus

Graphics memory

Page 22: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Tiled render

GPUshader cores

Memory bus

Graphics memory

Page 23: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Tiled render

GPUshader cores

Memory bus

Graphics memory

Page 24: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Shader instruction throughputFill rates on GPUs depend on a number of factorsMemory bandwidthBlend modeShader coresShader complexityEtc

Power efficient GPUs become shader throughput bound at approximately ~4 pixel shader instructions

Page 25: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Optimizing for low power GPUs

Page 26: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Bandwidth optimization: basicsRender opaque objects front-to-back with z-buffering

Disable alpha blending for opaque objects

Use geometry to trim large transparent areas

Page 27: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Bandwidth optimization: compress resourcesDirect3D supports texture compression at all feature levelsBC1 4-bits/pixel for RGB formats - 6x compression ratioBC2,3 8-bits/pixel for RGBA formats - 4x compression ratio

Smaller resources also means faster downloads of your app

Page 28: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Bandwidth optimization: quantize resourcesUse the 16 bit formats added to Direct3D 11.1:

DXGI_FORMAT_B5G6R5_UNORMDXGI_FORMAT_B5G5R5A1_UNORMDXGI_FORMAT_B4G4R4A4_UNORM

Page 29: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Bandwidth optimization: flip presentMust use DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL

OS automatically uses “fullscreen” flips when:Swapchain buffer dimensions match the desktop resolutionSwapchain format is DXGIFMT_B8G8R8A8_UNORM*App is the only content onscreen

Buffer dimensions need to be converted correctly from device independent pixels (dips)

Just create the swapchain with zero width and height to get the right size

Page 30: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

using namespace Windows::Graphics::Display;

float ConvertDipsToPixels(float dips){ static const float dipsPerInch = 96.0f; return floor(dips*DisplayProperties::LogicalDpi/dipsPerInch+0.5f);}

Platform::Agile<Windows::UI::Core::CoreWindow> m_window;

float swapchainWidth = ConvertDipsToPixels(m_window->Bounds.Width);float swapchainHeight = ConvertDipsToPixels(m_window->Bounds.Height);

Page 31: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Demo: Optimized flip presents

Page 32: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Bandwidth optimization: tiled render GPUsMinimize command buffer flushesDon’t map resources in use by the GPU, use DISCARD and NO_OVERWRITE

Minimize scene flushesVisit RenderTargets only once per frameDon’t update resources in use by the GPU from the CPU, use DISCARD and NO_OVERWRITE with ID3D11DeviceContext::CopySubresourceRegion1

Use scissors when updating small portions of a RenderTarget

Page 33: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Bandwidth optimization: tiled render GPUsNew Direct3D APIs provide hints to avoid unnecessary copies

Rendering artifacts if used incorrectly

Page 34: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Bandwidth optimization: Discard* APIs

m_swapChain->Present(1, 0); // present the image on the display

ComPtr<ID3D11View> view; m_renderTargetView.As(&view); // get the view on the RT

m_d3dContext->DiscardView(view.Get()); // discard the view

Use ID3D11DeviceContext1::DiscardView and ID3D11DeviceContext1::DiscardResource1 to prevent unnecessary tile copies

Artifacts if used incorrectly

Page 35: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Tiled render

GPUshader cores

Memory bus

Graphics memory

Page 36: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Tiled render

GPUshader cores

Memory bus

Graphics memory

Page 37: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Shader instruction throughputPower efficient GPUs have limited throughput for full precision

Minimum precision hints increase throughput when precision doesn’t matter

Specifies minimum rather than actual precisionmin16float, min16int, min10int

Don’t change precision often

20-25% improvement in practice with min16float

Page 38: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Minimum precisionstatic const float brightThreshold = 0.5f;

Texture2D sourceTexture : register(t0);float4 DownScale3x3BrightPass(QuadVertexShaderOutput input) : SV_TARGET{ float3 brightColor = 0; // Gather 16 adjacent pixels (each bilinear sample reads a 2x2 region) brightColor = sourceTexture.Sample(linearSampler, input.tex, int2(-1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2(-1, 1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1, 1)).rgb; brightColor /= 4.0f;

// Brightness thresholding brightColor = max(0, brightColor - brightThreshold);

return float4(brightColor, 1.0f);}

Page 39: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Minimum precisionstatic const min16float brightThreshold = (min16float)0.5;

Texture2D<min16float4> sourceTexture : register(t0);float4 DownScale3x3BrightPass(QuadVertexShaderOutput input) : SV_TARGET{ min16float3 brightColor = 0; // Gather 16 adjacent pixels (each bilinear sample reads a 2x2 region) brightColor = sourceTexture.Sample(linearSampler, input.tex, int2(-1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2(-1, 1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1, 1)).rgb; brightColor /= (min16float)4.0;

// Brightness thresholding brightColor = max(0, brightColor - brightThreshold);

return float4(brightColor, 1.0f);}

Page 40: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Minimum precision – bad usagestatic const min16float brightThreshold = (min16float)0.5;

Texture2D<min16float4> sourceTexture : register(t0);float4 DownScale3x3BrightPass(QuadVertexShaderOutput input) : SV_TARGET{ min16float3 brightColor = 0; // Gather 16 adjacent pixels (each bilinear sample reads a 2x2 region) brightColor = sourceTexture.Sample(linearSampler, input.tex, int2(-1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2(-1, 1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1, 1)).rgb; brightColor /= (min10int)4.0;

// Brightness thresholding brightColor = max(0, brightColor - brightThreshold);

return float4(brightColor, 1.0f);}

Page 41: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Wrap-upOptimize!

Use the right tools and techniques to measure performance

Tune for power efficient GPUs’ unique performance characteristics

Direct3D 11.1 and Windows 8 provide the APIs to fully leverage power efficient GPUs

Page 42: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Resources

Page 43: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

Build 2012 Talk: 3-113 Graphics with the Direct3D11.1 API made easyBuild 2012 Talk: 3-109 Developing a Windows Store app using C++ and DirectX

Visual Studio 2012 Remote Debugging: http://blogs.msdn.com/b/dsvc/archive/2012/10/26/windows-rt-windows-store-app-debugging.aspx

FPS Counter in GeometryRealization sample: http://code.msdn.microsoft.com/windowsapps/Geometry-Realization-963be8b7#content

GPUView: http://msdn.microsoft.com/en-us/library/windows/desktop/jj585574(v=vs.85).aspx

Direct3D11.1: http://msdn.microsoft.com/en-us/library/windows/desktop/hh404562(v=vs.85).aspx

Page 45: Agenda CPU Threads Flip Queue CPU Queues GPU Hardware Queue.

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.