Performance tips for Windows Store apps using DirectX and C++ Max McMullen Principal Development Lead – Direct3D Microsoft Corporation 4-102
Dec 17, 2015
Performance tips for Windows Store apps using DirectX and C++Max McMullenPrincipal Development Lead – Direct3DMicrosoft Corporation4-102
Overview
Measuring rendering performance
Power efficient GPU characteristics
Optimizing for power efficient GPUs
Agenda
Optimizing for the Windows 8/RT OSNew form factors and platforms require new optimizations
Windows uses DirectX to get every pixel on screen
Direct3D 11.1 provides new APIs to optimize rendering
Use optimized Windows 8/RT platformsAll Windows Store apps use DirectX for rendering
WWA & XAML optimized use of Direct2D and Direct3D 11.1
Direct2D and Direct2D Effects fully leverage Direct3D 11.1
But sometimes you really need to use Direct3D itself…
Many useful tools for Windows performance optimization:Visual Studio Performance Profiler, Visual Studio Graphics Diagnostics, hardware partner tools…
Two primary tools used to optimize Direct3D usage in the Windows 8/RT OS:Basic: FPS/time measurement in app/microbenchmarksAdvanced: GPUView
How do you measure rendering performance?
Frames per second (FPS)Quick but sometimes misleading
C++/DirectX Windows Store apps sync to the display refresh
Measure render time, not presentCall ID3D11DeviceContext::Flush instead of IDXGISwapchain::Present
Infrequent output: file output
Frequent output: look at FPSCounter.cpp in the GeometryRealization sample
GPUView
Part of the Windows Performance Toolkit
ETW Logging of CPU and GPU work
Measures graphics performanceFPS, startup time, glitching, render time, latency
Enables detailed analysis of CPU and GPU workloads and interdependencies
GPUView – Record and AnalyzeInstallx86: Windows Performance ToolkitARM: Windows Kits\8.0\Windows Performance Toolkit\Redistributables\WPTarm-arm_en-us.msi
RecordRun log.cmd to startPerform actionRun log.cmd to stop
AnalyzeData captured in merged.etl, load in GPUView
GPUView Interface: GPU Hardware Queue
The GPU Hardware Queue shows command buffers rendering on the GPU.CPU Queue command buffers moved to the GPU Hardware Queue when the hardware is ready to receive more commands.
What to expect with power efficient GPUsFeature level 9_1 or 9_3
Limited available bandwidth
Both immediate render and tiled render GPUs
Limited shader instruction throughput
Feature Level 9.x (FL9.1, FL9.3)
Real-time render limitations generally occur before reaching these maximums
Feature Level 9.1 9.3
Texture size 2048x2048 4096x4096
Pixel shader instructions
64 arithmetic, 32 sample
512 total
GPU Memory BandwidthBaseline requirement: 1.9 GB/sec benchmarked
7.5 I/O operation per screen pixel, 1366x768x32bpp@60hz
I/O Cost Operation
1 Screen Fill w/Solid Color
2 Screen Fill w/Texture
3 Screen Fill w/Texture & Alpha Blend
Shader instruction throughputFill rates on GPUs depend on a number of factorsMemory bandwidthBlend modeShader coresShader complexityEtc
Power efficient GPUs become shader throughput bound at approximately ~4 pixel shader instructions
Bandwidth optimization: basicsRender opaque objects front-to-back with z-buffering
Disable alpha blending for opaque objects
Use geometry to trim large transparent areas
Bandwidth optimization: compress resourcesDirect3D supports texture compression at all feature levelsBC1 4-bits/pixel for RGB formats - 6x compression ratioBC2,3 8-bits/pixel for RGBA formats - 4x compression ratio
Smaller resources also means faster downloads of your app
Bandwidth optimization: quantize resourcesUse the 16 bit formats added to Direct3D 11.1:
DXGI_FORMAT_B5G6R5_UNORMDXGI_FORMAT_B5G5R5A1_UNORMDXGI_FORMAT_B4G4R4A4_UNORM
Bandwidth optimization: flip presentMust use DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL
OS automatically uses “fullscreen” flips when:Swapchain buffer dimensions match the desktop resolutionSwapchain format is DXGIFMT_B8G8R8A8_UNORM*App is the only content onscreen
Buffer dimensions need to be converted correctly from device independent pixels (dips)
Just create the swapchain with zero width and height to get the right size
using namespace Windows::Graphics::Display;
float ConvertDipsToPixels(float dips){ static const float dipsPerInch = 96.0f; return floor(dips*DisplayProperties::LogicalDpi/dipsPerInch+0.5f);}
…
Platform::Agile<Windows::UI::Core::CoreWindow> m_window;
float swapchainWidth = ConvertDipsToPixels(m_window->Bounds.Width);float swapchainHeight = ConvertDipsToPixels(m_window->Bounds.Height);
Bandwidth optimization: tiled render GPUsMinimize command buffer flushesDon’t map resources in use by the GPU, use DISCARD and NO_OVERWRITE
Minimize scene flushesVisit RenderTargets only once per frameDon’t update resources in use by the GPU from the CPU, use DISCARD and NO_OVERWRITE with ID3D11DeviceContext::CopySubresourceRegion1
Use scissors when updating small portions of a RenderTarget
Bandwidth optimization: tiled render GPUsNew Direct3D APIs provide hints to avoid unnecessary copies
Rendering artifacts if used incorrectly
Bandwidth optimization: Discard* APIs
m_swapChain->Present(1, 0); // present the image on the display
ComPtr<ID3D11View> view; m_renderTargetView.As(&view); // get the view on the RT
m_d3dContext->DiscardView(view.Get()); // discard the view
Use ID3D11DeviceContext1::DiscardView and ID3D11DeviceContext1::DiscardResource1 to prevent unnecessary tile copies
Artifacts if used incorrectly
Shader instruction throughputPower efficient GPUs have limited throughput for full precision
Minimum precision hints increase throughput when precision doesn’t matter
Specifies minimum rather than actual precisionmin16float, min16int, min10int
Don’t change precision often
20-25% improvement in practice with min16float
Minimum precisionstatic const float brightThreshold = 0.5f;
Texture2D sourceTexture : register(t0);float4 DownScale3x3BrightPass(QuadVertexShaderOutput input) : SV_TARGET{ float3 brightColor = 0; // Gather 16 adjacent pixels (each bilinear sample reads a 2x2 region) brightColor = sourceTexture.Sample(linearSampler, input.tex, int2(-1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2(-1, 1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1, 1)).rgb; brightColor /= 4.0f;
// Brightness thresholding brightColor = max(0, brightColor - brightThreshold);
return float4(brightColor, 1.0f);}
Minimum precisionstatic const min16float brightThreshold = (min16float)0.5;
Texture2D<min16float4> sourceTexture : register(t0);float4 DownScale3x3BrightPass(QuadVertexShaderOutput input) : SV_TARGET{ min16float3 brightColor = 0; // Gather 16 adjacent pixels (each bilinear sample reads a 2x2 region) brightColor = sourceTexture.Sample(linearSampler, input.tex, int2(-1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2(-1, 1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1, 1)).rgb; brightColor /= (min16float)4.0;
// Brightness thresholding brightColor = max(0, brightColor - brightThreshold);
return float4(brightColor, 1.0f);}
Minimum precision – bad usagestatic const min16float brightThreshold = (min16float)0.5;
Texture2D<min16float4> sourceTexture : register(t0);float4 DownScale3x3BrightPass(QuadVertexShaderOutput input) : SV_TARGET{ min16float3 brightColor = 0; // Gather 16 adjacent pixels (each bilinear sample reads a 2x2 region) brightColor = sourceTexture.Sample(linearSampler, input.tex, int2(-1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2(-1, 1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1, 1)).rgb; brightColor /= (min10int)4.0;
// Brightness thresholding brightColor = max(0, brightColor - brightThreshold);
return float4(brightColor, 1.0f);}
Wrap-upOptimize!
Use the right tools and techniques to measure performance
Tune for power efficient GPUs’ unique performance characteristics
Direct3D 11.1 and Windows 8 provide the APIs to fully leverage power efficient GPUs
Build 2012 Talk: 3-113 Graphics with the Direct3D11.1 API made easyBuild 2012 Talk: 3-109 Developing a Windows Store app using C++ and DirectX
Visual Studio 2012 Remote Debugging: http://blogs.msdn.com/b/dsvc/archive/2012/10/26/windows-rt-windows-store-app-debugging.aspx
FPS Counter in GeometryRealization sample: http://code.msdn.microsoft.com/windowsapps/Geometry-Realization-963be8b7#content
GPUView: http://msdn.microsoft.com/en-us/library/windows/desktop/jj585574(v=vs.85).aspx
Direct3D11.1: http://msdn.microsoft.com/en-us/library/windows/desktop/hh404562(v=vs.85).aspx
• Develop: http://msdn.microsoft.com/en-US/windows/apps/br229512
• Design: http://design.windows.com/
• Samples: http://code.msdn.microsoft.com/windowsapps/Windows-8-Modern-Style-App-Samples
• Videos: http://channel9.msdn.com/Windows
Resources
Please submit session evals by using the Build Windows 8 appor at http://aka.ms/BuildSessions
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.