YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 142

Swapchains Unchained!(What you need to know about Vulkan WSI)

Alon Or-bach, Chair, Vulkan Window System Integration Sub-Group – March 2016

Page 2: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 143

Intro to Vulkan Window System Integration• Explicit control for acquisition and

presentation of images - Designed to fit the Vulkan API and today’s

compositing window systems

• Not all extensions are supported by every platform- You MUST check and enable the extensions

your app/engine uses!!!

• Today’s presentation should help you get presentation working- Learn how to present through a swapchain

- Overview of Vulkan objects used by the WSI

extensions

WSI Jargon Buster• Platform

Our terminology for an OS

/ window system e.g.

Android, Windows,

Wayland, X11 via XCB

• Presentation EngineThe platform’s compositor

or display engine

• ApplicationYour app or game engine

Page 3: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 144

How many WSI extensions are there?• Two cross-platform instance extensions- VK_KHR_surface

- VK_KHR_display

• Six (platform) instance extensions- VK_KHR_android_surface

- VK_KHR_mir_surface

- VK_KHR_wayland_surface

- VK_KHR_win32_surface

- VK_KHR_xcb_surface

- VK_KHR_xlib_surface

• Two cross-platform device extensions- VK_KHR_swapchain

- VK_KHR_display_swapchain

Page 4: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 145

Vulkan Surfaces • VkSurfaceKHR- Vulkan’s way to encapsulate a native

window / surface

• Platform-independent surface queries- Find out crucial information about your

surface’s properties- e.g., if presentation is supported by a

particular queue on a particular device

- Some platforms provide additional queries

• An implementation may support multiple platforms- e.g., both xlib and xcb

Physical Device A

Platform X

Queue Family 2

Queue Family 1 Queue

Family 0

Platform Y

Physical Device B

Queue Family 1Queue

Family 0

Surface from

Platform X

Physical Device C

Queue Family 1Queue

Family 0

Page 5: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 146

Vulkan Swapchains: VK_KHR_swapchain• Array of presentable images associated with

a surface- Application requests a minimum number

of presentable images

- Implementation creates at least that

number

- Implementation may have a limit

• Upfront allocation of presentable images- No allocation hitching at crucial moment

- Pre-record fixed content command buffers

• Present mode determines behavior- FIFO support mandatory

- Platforms can offer mailbox,

immediate, FIFO relaxed

const VkSwapchainCreateInfoKHR createInfo ={VK_STRUCTURE_TYPE_SWAPCHAIN_CREATE_INFO_KHR, // sTypeNULL, // pNext0, // flagsmySurface, // surfacedesiredNumberOfPresentableImages, // minImageCountsurfaceFormat, // imageFormatsurfaceColorSpace, // imageColorSpacemyExtent, // imageExtent1, // imageArrayLayersVK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT, // imageUsageVK_SHARING_MODE_EXCLUSIVE, // imageSharingMode0, // queueFamilyIndexCountNULL, // pQueueFamilyIndicessurfaceProperties.currentTransform, // preTransformVK_COMPOSITE_ALPHA_INHERIT_BIT_KHR, // compositeAlphaswapchainPresentMode, // presentModeVK_TRUE, // clippedVK_NULL_HANDLE // oldSwapchain};

Page 6: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 147

Vulkan Swapchains: They’re good!• Application knows which image within a

swapchain it is presenting- Content of image preserved between

presents

• Application is responsible for explicitly recreating swapchains - no surprises- Platform informs app if current swapchain

- Suboptimal: e.g. after window resize,

swapchain still usable for present via image

scaling

- Surface Lost: swapchain no longer usable for

present

- Application is responsible to create a new

swapchain

Page 7: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 148

Vulkan Swapchains: They’re jolly good!• Presenting and acquiring are separate

operations- No need to submit a new image to acquire

another one, unless presentation engine

cannot release it

• Application must only modify presentable images it has acquired

• Presentation engine must only display presentable images that have been presented!

Stalls in frame loop are very bad!

Page 8: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 149

VK_KHR_<platform>_surface

VK_KHR_surface

VK_KHR_swapchain

Platform-specific APIs

Steps to setup your presentable images1 – Create a native window/surface

2 – Create a Vulkan surface

3 – Query information about your surface

4 – Create a Vulkan swapchain

5 – Get your presentable images

Page 9: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 150

VK_KHR_swapchain

Vulkan Frame Loop – as easy as 1-2-3!

2 – Submit command buffer(s) for that image

1 – Acquire the next presentable image 3 – Present the image

0 – Create your swapchain

LegendSetup

Steady-state

Response to suboptimal

/ surface_lost

Page 10: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 151

Vulkan Displays: VK_KHR_display• Vulkan’s way to discover display devices

(screens, panels) outside a window system- Reminder: Not supported on all platforms

• Defines VkDisplayKHR and VkDisplayModeKHR objects- Represent the display devices and the

modes they support connected to a

VkPhysicalDevice

- Determine if a display supports multiple

planes that are blended together

• Enables creation of a VkSurfaceKHR to represent a display plane

Physical Device

Surface

Display 0

Plane 2Plane 1

Plane 0

Display Mode 1Display

Mode 0

Display 1

Display Mode 1Display

Mode 0

Page 11: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 152

VK_KHR_display_swapchain• Extends the information provided at vkQueuePresentKHR- What region to present from the swapchain image

- What region to present to on the display

- Whether the display should persist the image

• Adds ability to create a shared swapchain- Swapchain that takes multiple VkSwapchainCreateInfoKHR structs

- Allows multiple displays to be presented to simultaneously

- No guarantee that presents are atomic ...presently!

Page 12: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 153

Any question?

[email protected]@alonorbach (disclaimers apply!)

Page 13: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 1

LunarG® SDK for Vulkan®

Karen Ghavam, CEOKarl Schultz, Principal EngineerJon Ashburn, Principal Engineer

Page 14: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 2

Enter the Raffle for your prize!Congratulations!

You are the recipient of the Vulkan Programming Guide, courtesy of LunarG!

Is your OpenGL Programming Guide getting lonely? Well, it will soon have a companion. In August 2016, when the Vulkan Programming Guide becomes available, LunarG will ship it directly to you!

In the meantime, visit LunarXchange (Vulkan.lunarg.com) for the LunarG SDK for Vulkan, and accept this book bag, anxiously awaiting its Vulkan Programming Guide.

Page 15: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 3

LunarG SDK• Loader Binary• Validation Layer Libraries• Vulkan trace and replay tools- vktrace- vkreplay

• SPIR-V Tools- GLSL Validator - SPIR-V Disassembler and Assembler - SPIR-V Remapper

• RenderDoc*• Sample Programs

*For a detailed demonstration of RenderDoc don’t miss:Practical Development for Vulkan (presented by Valve Software). Thursday. 12:45 – 1:45. Room 3009, West Hall

Page 16: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 4

Download the LunarG SDK for Vulkan at LunarXchange: vulkan.lunarg.com

Version 1.0.5.0 now available!

Page 17: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 5

The Power of a Layered Ecosystem

Development pathValidation

layer

Debug layer

Other layers

Production path

Vulkan application

Installable Client Driver

Vulkan application

Installable Client Driver

Loader

Loader

Page 18: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 6

Layers: Fully IntegratedProgrammatic Approach

Vulkan application

Debug Report

Callback

Installable Client Driver

Layer

Application supplies list

of layers

Application handles messages in

callback

Layers report “results” as

messages

Loader

Page 19: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 7

Layers: Externally Activated“Ad-hoc” Approach

Vulkan application

Debug Report

Callback

Installable Client Driver

Layer

User sets environment variables:

VK_INSTANCE_LAYER=“layer name”

Default Debug Report writes to output stream

Layers report “results” as

messages

Loader

Layer Settings File

Page 20: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 8

Demo We’ll Be Using

“Hologram”By

Chia-I Wu (olv)

• Well-written Vulkan demo• Simulation of 5000 moving objects• Demonstrates multi-threaded command

buffer recording• Can be found in:• https://github.com/LunarG/VulkanSamples

Page 21: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 9

Demo!

Watch the demo for a minute or so

Page 22: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 10

A Few Hologram Internals – Object Data

5000 ShaderParamBlocks

struct ShaderParamBlock {float light_pos[4];float light_color[4];float model[4 * 4];float view_projection[4 * 4];

};

One ShaderParamBlock per Object

For Each Frame and For Each Object:• Modify ShaderParamBlock• BindDescriptorSet

Two Frames of Object Data

Page 23: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 11

Modify DemoLet’s add code to modulate the transparency of each object, independently, as a function of time.To do this, we need to:

1. Add a parameter to the ShaderParamBlock: “per-object” alpha2. Modify the shader program to apply the per-object alpha3. Modify the Simulation to change the transparency of each object over time

Start with Step 1!struct ShaderParamBlock {

float light_pos[4];

float light_color[4];

float model[4 * 4];

float view_projection[4 * 4];

float alpha;

};

Page 24: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 12

Let’s See What Happens

Change the code and re-run demo

Page 25: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 13

More Information• Layer Documentation- LunarXchange website (https://vulkan.lunarg.com/app/docs/latest/layers)- More details on validation and other layers

• Screenshot Layer- Good for showing someone else what is wrong- Also can be used for before/after image-compare testing

• Vktrace/Vkreplay- Useful for sending someone a trace file in lieu of setting up a reproduction

scenario

Page 26: GDC : Mar16.pdf

A next gen Engine design on a next gen API

Dan Baker

Graphics Architect, Oxide Games

Page 27: GDC : Mar16.pdf

Nitrous design philosophies

• Job based threading

• Message based systems

• Redundant, shallow state design

• Always evaluate – opposite of Lazy Evaluation

• Efficient memory streaming

• Asynchronous systems

Page 28: GDC : Mar16.pdf

Data driven design

Unit AI System

MessageQueue

Physics Queue

FOW queue

Minimap queue

Message Dispatcher

Page 29: GDC : Mar16.pdf

Relating to Graphics Stack

• Collection of messages and systems extends into graphics

• Dozens of independent systems can operate in parallel

• Big systems internally parrelize (e.g. particles, unit rendering)

Page 30: GDC : Mar16.pdf

A modern API

• Concept of message based, asynchronous design well matched

Exposure of asynchronous nature of a GPU is the key design difference of Vulkan over OpenGL/D3D11

Page 31: GDC : Mar16.pdf

A contract between App and API

• Application will not make conflicting calls on the same objects (e.g. writing one object while another is reading it)

• Driver will generally not lock or serialize any API call– Context information is embedded on the

object being operated on

– With exception to occasional CPU side memory allocation (but should be rare occurrence on create calls)

Page 32: GDC : Mar16.pdf

Application runs parallel to GPU

Even Command Buffers

Odd Command Buffers

Delete Queue

Delete Queue

Application GPU

Flush Queue

Page 33: GDC : Mar16.pdf

Application runs parallel to GPU

Even Command Buffers

Odd Command Buffers

Delete Queue

Delete Queue

Application GPU

Flush Queue

Page 34: GDC : Mar16.pdf

Review

• When we say Vulkan is free threaded, we mean– most API function calls are operators. They operate only on data which

is passed into them as output, and read-only the data passed on that as input

– API function calls are transparent for thread safety: valid to call so long as the there is no read/write or write/write hazards. Apps responsibility to manage them

– GPU/CPU hazard is explicitly exposed. GPUs are read operators on data, therefore read/write hazards between CPU/GPU must also be managed by application

– In General, API function calls will not have locks in them• With exception to calls which must allocate some types of memory

Page 35: GDC : Mar16.pdf

Old way

Sim Job Sim Job Sim JobCore 1

Current Frame

Sim JobCore 2

Sim JobCore 3

Sim JobCore 4

AI Job

Sim Job

Graphics

Core 5

Game Job

Core 6

???GPU Fence, or CPU wait???

Sim Job

Sim Job

Sim Job

Graphics (Opaque, in driver)

AI Job

Game Job

Game Job

Dead time

Game Job

Game Job

AI Job AI Job

Physics Job

Physics Job

Physics Job

Page 36: GDC : Mar16.pdf

Old Way

Driver related cores. Missing time due to thread accounting and system level synchronization primitives

Lots of unused CPU space! Engine is just waiting for driver to be done

Page 37: GDC : Mar16.pdf

Powerful New model

Sim Job Sim Job Sim JobVulkan

CMD JobVulkan

CMD JobCore 1

Current Frame

Sim Job

Sim Job

VulkanCMD Job

VulkanCMD JobCore 2

Sim Job Sim JobVulkan

CMD JobVulkan

CMD JobCore 3

Sim Job Sim JobVulkan

CMD JobVulkan

CMD JobCore 4

AI Job

Sim Job Sim JobVulkan

CMD JobVulkan

CMD JobCore 5

Game Job

Sim Job Sim JobVk present

JobCore 6

GPU Fence End of Frame

Sim Job

Sim Job

Sim Job

Sim JobVulkan

CMD JobVulkan

CMD Job

AI Job

Game Job

Game Job

Next Frame

Page 38: GDC : Mar16.pdf

New way

Vulkan simulation using a modified Mantle build to simulate infinitely fast GPU

Page 39: GDC : Mar16.pdf

Difficult part of Vulkan

• Need to have a strategy for rendering up front, not lazy eval

• Before can setup shader, need to understand bindings, before bindings, need to understand descriptors– Probably need to know these even before a descriptor is

created

• The more you can know about a render job at compile time, the easier Vulkan will be

Page 40: GDC : Mar16.pdf

Setting up the Engine

• Pipelines created up front, combination(s) specified in shaderlanguage

• No concept of individual shader stages – Vertex/Fragment considered one block

• 64 mb temp buffer created for each frame– Shader constants– No buffers are updated directly– Any updates are dumped into staging buffer and copied – When 64 mbs is exceeded, slow allocation path is used, typically only

initialization

• Internal command format that can be built in parallel

Page 41: GDC : Mar16.pdf

Shader Combos

• Large, monolithic blocks with many state folded in

– Shaders

– Alpha state

– MSAA state

– Depth State

• Managing combinatorics is major challenge

Page 42: GDC : Mar16.pdf

Shader Combos

• Very unlikely that hardware actually needs to create unique pipeline object– The problem is that each hardware has a different state that might

require a new shader

• Vulkan has bulk shader create – Give a bunch of shader combinations at once to driver– Most likely driver only has to create a few actual shaders

• Nitrous does group creates – 20-40 combinations of a pipeline that might get used. A little bit of pruning for shader author

Page 43: GDC : Mar16.pdf

Pipeline serialization

• Major problem with D3D12

• Serialization context is passed into shader create

– Needed because most pipelines are not unique

• Driver will use this is a database to store compiled pipeline object

• Can serialize the whole database

Page 44: GDC : Mar16.pdf

Texture Sets

• Nitrous eliminates individual shader bindings

• Textures must be part of groups

• Maps to a descriptor set

Page 45: GDC : Mar16.pdf

Bind Vector

Batch Shader SetPrimitive (vertices)

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Constant Set

Constant Set

Constant Set

Constant Set

Constant Set

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Constant Set

Constant Set

Constant Set

Constant Set

Constant Set

Page 46: GDC : Mar16.pdf

Bind VectorTexture Set

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Texture Set

Constant Set

Constant Set

Constant Set

Constant Set

Constant Set

• Becomes a Layout in Vulkan• Layouts are specified during the shader

creation stage• Nitrous uses only 1 master layout

• Most engines will use multiple• Switching layouts has cost

• Can easily sort off redundant changes, only call bind descriptor when something needs changing

Page 47: GDC : Mar16.pdf

Manging hazards

• The trickiest part of Vulkan• Must manage any time a resource will be used differently

– Cache Flush– Operator barrier– Decompression

• USE THE VALIDATOR– Could get correct results on current hardware only to see problems on future

hardware– No different then multi-threaded coding

• Consider having engine layer automatically partially calculate barriers– Good design should do a good job– Nitrous is 100% explicit right now, but will likely to switch to partial automatic system

Page 48: GDC : Mar16.pdf

General performance

• Shader auto recompiling won’t happen automatically– Constant folding

– But no frame stutters due to recompiles

• Memory barriers can introduce stalls

– Need to plan out

• Changing pipelines, layouts frequently

Page 49: GDC : Mar16.pdf

Threading/Command buffers

• Best idea is to have many command buffers, but 1 allocator per thread per frame queued

• Command buffer allocation can cause memory bloat

• Nitrous sorts command buffers from estimated size, largest first, down to smallest

Page 50: GDC : Mar16.pdf

Questions

twitter: dankbaker, oxidegames

Page 51: GDC : Mar16.pdf

Performance Lessons from Porting Source 2 to Vulkan

Dan Ginsburg

Page 52: GDC : Mar16.pdf

Overview

Dota 2 Vulkan Performance Results

Performance Lessons Learned

Page 53: GDC : Mar16.pdf

Overview

Dota 2 Vulkan Performance Results

Performance Lessons Learned

Page 54: GDC : Mar16.pdf

Source 2 Overview

OpenGL, Direct3D 9, Direct3D 11, Vulkan

Windows, Linux, Mac

Dota 2 Reborn

Page 55: GDC : Mar16.pdf

Dota 2 Performance Results - Disclaimer

Not an ideal showcase for Vulkan

Source 2 renderer is multithreaded, but…

Dota 2 is only ~1500 draw calls per frame

Allows DX/GL a frame of latency to avoid being

renderthread bound

Does not (yet!) take advantage of:

Baking descriptors

Command buffer resubmission

Page 56: GDC : Mar16.pdf

Dota 2 Performance Results - Disclaimer

Not an ideal showcase for Vulkan

Source 2 renderer is multithreaded, but…

Dota 2 is only ~1500 draw calls per frame

Allows DX/GL a frame of latency to avoid being

renderthread bound

Does not (yet!) take advantage of:

Baking descriptors

Command buffer resubmission

Still very pleased with results!

Page 57: GDC : Mar16.pdf

Dota 2 Vulkan Performance – DX9 Latency

Frame Start Frame End

Page 58: GDC : Mar16.pdf

Dota 2 Vulkan Performance – DX9 Latency

Frame Start Frame End Present Issued

Page 59: GDC : Mar16.pdf

Dota 2 Vulkan Performance – DX9 Latency

Frame Start Frame End Present Issued

DX9 Latency: 3.8ms

Page 60: GDC : Mar16.pdf

Dota 2 Vulkan Performance – Vulkan Latency

Frame Start Frame End

Page 61: GDC : Mar16.pdf

Dota 2 Vulkan Performance – Vulkan Latency

Frame Start Frame End Present Issued

Page 62: GDC : Mar16.pdf

Dota 2 Vulkan Performance – Vulkan Latency

Frame Start Frame End Present Issued

Vulkan Latency: 0.4ms (!)

Page 63: GDC : Mar16.pdf

Dota 2 Vulkan – Latency Reduction

Renderthread no longer a bottleneck

Reduces “wallclock” time of frame

Time from end of frame to present reduced by 3.4ms

Really important for:

Latency sensitive games (eSports)

VR

Page 64: GDC : Mar16.pdf

Dota 2 Vulkan - Framerate

Two timedemos:

Typical Dota 2 Match

High Drawcall Battle Scene

Test system:

NVIDIA TITAN X 356.45

i7-3770k @ 3.50GHz

Test settings:

Resolution: 640x480 (CPU Perf)

Highest Rendering Quality

Vulkan/GL/DX9/DX11

Page 65: GDC : Mar16.pdf

Dota 2 Timedemo – Typical Dota 2 Match

Page 66: GDC : Mar16.pdf

Dota 2 Timedemo – Typical Dota 2 Match

182.95

170.55

188.5

128.1

FPS

NVIDIA TITAN X i7 3770k 640x480 356.45 - HQ

Vulkan OpenGL DX9 DX11

Page 67: GDC : Mar16.pdf

Dota 2 Timedemo – Battle Scene

Page 68: GDC : Mar16.pdf

Dota 2 – High Drawcall Timedemo

85.3

75.15 75.65

67.5

FPS

NVIDIA TITAN X i7 3770k 640x480 356.45 - HQ

Vulkan OpenGL DX9 DX11

Page 69: GDC : Mar16.pdf

Dota 2 Vulkan Performance - Overall

Significant latency reduction

Improved framerate in heavy scenes

Only going to get better…

Page 70: GDC : Mar16.pdf

Overview

Dota 2 Vulkan Performance Results

Performance Lessons Learned

Page 71: GDC : Mar16.pdf

Overview

Dota 2 Vulkan Performance Results

Performance Lessons Learned

Command Buffer Recycling

Command Buffer Batching

Redundant Call Filtering

Updating Descriptors

Pipeline Cache Usage

Page 72: GDC : Mar16.pdf

Command Buffer Recycling Overview

At least one VkCommandPool per thread

Recycling options:

vkResetCommandPool – resets all command buffers in

pool

vkResetCommandBuffer – reset single command buffer

Reset can either recycle or release resources

Page 73: GDC : Mar16.pdf

Command Buffer Recycling

Souce 2 recycles individual command buffers after

completion

vkBeginCommandBuffer costly

Using VK_COMMAND_BUFFER_RESET_RELEASE_RESOURCES_BIT

Driver reallocates resources

Done to reduce memory footprint, but came at perf cost

Page 74: GDC : Mar16.pdf

Fast Command Buffer Recycling

vkCreateCommandPool

Use VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT

vkResetCommandBuffer( pCmdBuffer, 0 )

flags == 0, keeps resources for reuse

Downside: memory growth

Source 2 strategy for handling memory growth:

Destroy command buffers no longer needed

Heuristic to destroy command buffers

Page 75: GDC : Mar16.pdf

Command Buffer Batching

vkQueueSubmit implies a flush

Also has CPU costs – memory residency

Important to batch submits

Page 76: GDC : Mar16.pdf

Command Buffer Batching

Page 77: GDC : Mar16.pdf

Command Buffer Batching

Batched submit: ~0.7ms / frame

Page 78: GDC : Mar16.pdf

Command Buffer Batching

Batched submit: ~0.7ms / frame Unbatched submits: ~4.5ms / frame

Page 79: GDC : Mar16.pdf

Source 2 Command Buffer Batching

Gather command buffers on renderthread

Up to a threshold, needed during load time

Wait for present request

Issue single submit with all batched command buffers

Page 80: GDC : Mar16.pdf

Redundant Call Filtering

Your job now!

Vulkan drivers may not (should not!) filter calls

If we don’t do it, we will force IHVs to

Hurts the good apps at the expense of the bad

Examples from Source 2:

vkCmdBindIndexBuffer

vkCmdBindVertexBuffers

vkCmdBindPipeline

Dynamic render state

vkCmdSet*

Page 81: GDC : Mar16.pdf

Updating Descriptors

vkUpdateDescriptorSets #1 hotspot

vkCmdBindDescriptorSets #2 hotspot

Source 2 approach:

Single pipeline layout shared across all pipelines

Descriptor sets will have unused entries

Update/bind descriptor set per draw

Not efficient!

Page 82: GDC : Mar16.pdf

Updating Descriptors – The Right Way

In shaders, organize descriptor sets by update

frequency

Bake descriptor sets up front

Use compatible pipeline layouts to simplify descriptor

allocation

Page 83: GDC : Mar16.pdf

Updating Descriptors – The Right Way

In shaders, organize descriptor sets by update

frequency

Bake descriptor sets up front

Use compatible pipeline layouts to simplify descriptor

allocation

…we plan to do this in the future. Will help perf a lot.

Page 84: GDC : Mar16.pdf

Pipeline Creation

vkCreateShaderModule is relatively fast

Loads in the SPIR-V, no heavy compilation

~0.01ms in Dota 2

vkCreateGraphicsPipelines is expensive

Driver performs shader compile here

0.2 – 152ms in Dota 2 before cache is warmed

Page 85: GDC : Mar16.pdf

Vulkan Pipeline Cache

Serialize compiled pipelines to disk

Preload to remove first-time stutters

Header contains VendorID/DeviceID/UUID

Otherwise opaque format

Avoid unnecessary shader compiles

Driver de-duplicates

Only driver knows when recompile is needed based on

state

Pipeline cache should contain only unique pipelines

Allows compilation on multiple threads

Merge later using vkMergePipelineCaches

Page 86: GDC : Mar16.pdf

Summary

Dota 2 Vulkan Performance Results

Reduced latency

Improved framerate in expensive scenes

Performance Lessons Learned

Command Buffer Recycling

Command Buffer Batching

Redundant Call Filtering

Updating Descriptors

Pipeline Cache Usage

Page 87: GDC : Mar16.pdf

Questions?

Page 88: GDC : Mar16.pdf

Vulkan Does RetroA Vulkan Use-Case Study with RetroArch and libretro

Hans-Kristian Arntzen – GDC 2016

Page 89: GDC : Mar16.pdf

Background• Me

• Multimedia programming since 2009

• Co-founder of RetroArch project in 2010-2011

• Working at ARM hacking on the Mali GPUs since 2014

• Contributed Vulkan backend on launch day

• RetroArch / libretro

• Multi-platform system optimized for enjoying retro content

• Plugin abstraction to support many different systems

• Strong focus on portability and performance

Page 90: GDC : Mar16.pdf

Problem• Retro content usually needs to render on CPU

• Emulators of classic consoles in particular is a prime example

• Get software rendered images to screen fast and reliably

• Blazing fast texture uploads part of the equation

CPU

GPU magic

Page 91: GDC : Mar16.pdf

Streaming with Vulkan• Vulkan exposes VK_IMAGE_TILING_LINEAR

• Finally! For some reason, never added to OpenGL

• GPUs can sample from these textures• At least on the Vulkan drivers I have tested ...

• No reason to copy from linear to optimal layout (used once!)

• Vulkan supports persistently mapped memory• Finally, us GLES folks can do it right -

• Combine this to a dream scenario• Persistently map a ring buffer of linear textures

• Let libretro core render directly into HOST_VISIBLE memory or use pure memcpy()

Page 92: GDC : Mar16.pdf

Caveats• Vulkan doesn’t require support for sampling linear textures

• Might need fallback

• Linear textures might not be DEVICE_LOCAL• Mostly a desktop thing

• Might need same fallback as before ...

• Memory might not be cached• Fallback to copy if we want to blend on the surface

• Simple, vendor-neutral fallbacks• If we hit either case, copy linear texture to DEVICE_LOCAL

• Might as well copy to OPTIMAL tiling layout

• vkCmdCopyImage (or vkCmdCopyBufferToImage)

Page 93: GDC : Mar16.pdf

The various ways to copy ...• Ring buffered textures with glTexSubImage appears to be best

• We already did the hard part for the driver• Texture is not in use by GPU, should allow optimal path• Only way in pure GLES2

• Classic async PBO uploads have extra overhead on all drivers• After all, have to copy to PBO, then copy to texture• Doesn’t accomplish anything over plain SubImage in our case

• AZDO-style PBO seems interesting ... but• Observed bizzarre 10x performance dips in TexSubImage• So much for that ...

• On Raspberry Pi 1, things got weirder ...• Optimal path was uploading to OpenVG texture• Share image with GLES via EGL ...

Page 94: GDC : Mar16.pdf

Benchmark• NES video from Nestopia libretro core

• 256x240 resolution @ 32 bpp

• Ran through RetroArch’s Vulkan and GL backends• Measurements

• Time to copy texture from CPU to texture

• Time spent overall to submit frame

• Measured on Linux

Page 95: GDC : Mar16.pdf

OpenGL results• Sure, we’re measuring in microseconds• We can do so much better!

• * GL calls were blocking mid-frame• Probably rate-limiting waiting for older frames

CPU GPU Copy OpenGL (µs) Frame OpenGL (µs)

i5-5257U @ 2.70 Intel HD 6100 (Mesa) 130 N/A (*)

i7 920 @ 2.66 nVidia GTX 760 272 302

Cortex-A17 @ 1.8 Mali T-764 585 806

Page 96: GDC : Mar16.pdf

Vulkan delivers!• Copy time essentially a memcpy() benchmark

• Overall frame times way better than the GL texture upload!

• Great uplifts across the board

• Still room for improvement

CPU GPU Copy Vulkan (µs) Frame Vulkan (µs) Copy uplift

i5-5257U @ 2.70 Intel HD 6100 (Mesa) 27 122 352 %

i7 920 @ 2.66 nVidia GTX 760 46 69 491 %

Cortex-A17 @ 1.8 Mali T-764 80 215 631 %

Page 97: GDC : Mar16.pdf

Conclusion• Even humble 2D applications can gain from Vulkan

• Not reserved for the highest-end engine developers

• Vulkan provides a far more direct and simple path to perf

• Fast paths are more obvious than before

• Going from good to great is much simpler in Vulkan

Page 98: GDC : Mar16.pdf

THANKS!

@themaister

github.com/Themaister

github.com/libretro/RetroArch

Page 99: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 191

Porting Cinder to VulkanHai Nguyen, Google

Page 100: GDC : Mar16.pdf

GFXBench 5 - Aztec RuinsBenchmarking Vulkan

Gergely Juhasz, Lead Gfx Engineer @Kishonti

Page 101: GDC : Mar16.pdf

GFXBench 5 in a nutshell

• Concept• Working title: Aztec Ruins

• Entirely new rendering engine• In-house render API for Vulkan, Metal, DX12• Also on OpenGL 4.3+, ES 3.2, DX11 for comparison• Algorithmic and workload parity across different backends

• High-end graphics features• Real time dynamic GI• Complex shading and advanced post-effects

• State• Near to Beta• Gold version expected by Q3

Page 102: GDC : Mar16.pdf

Actual engine footage

Page 103: GDC : Mar16.pdf

Render pipeline – Direct lights

Page 104: GDC : Mar16.pdf

Render pipeline – Dynamic shadows

Page 105: GDC : Mar16.pdf

Render pipeline – Global illumination

Page 106: GDC : Mar16.pdf

Render pipeline – Post-process

Page 107: GDC : Mar16.pdf

Global illumination

• Probes capture the lighting conditions

• SH is generated for every probe

• Final scene is shaded by deferred irradiance lights

• Well fits in Vulkan’s subpass concept

Page 108: GDC : Mar16.pdf

Subpass 1 – Geometry

Page 109: GDC : Mar16.pdf

Subpass 2 – Lighting

Page 110: GDC : Mar16.pdf

Final step – Post effects

Page 111: GDC : Mar16.pdf

Multi-threaded command recording 1

Render job Render targets

Render states

Drawcalls

A B

D EC

F

Dependency graphPipeline consists of several render jobs

Page 112: GDC : Mar16.pdf

Multi-threaded command recording 2

Command buffer

Command buffer

Command buffer

Command buffer

Main thread Command queue

Main rendering thread submits the command buffers according to the dependency graph

Page 113: GDC : Mar16.pdf

Future development plans

• Planned rendering features• Indirect specular highlights and shadows by GI

• Deferred decals

• Animated vegetation

• Compute based motion blur

• Atmospheric effects, particles

• VR

Page 114: GDC : Mar16.pdf
Page 115: GDC : Mar16.pdf
Page 116: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 208

Comparing Vulkan to OpenGL (ES)

Barthold LichtenbeltMarch 16, 2016

Page 117: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 209

Beneficial Vulkan Scenarios

Is your graphicswork CPU bound?

Can your graphicscreation be parallelized?

start

yes

Vulkanfriendly

Your graphicsplatform is fixed

You’lldo what it

takes to squeeze outMax perf.

You put a premium on

avoidinghitches

You canmanage your

graphics resourceallocations

yes

yes

yes

yes

yes

Page 118: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 210

Unlikely to Benefit

Scenarios to reconsider coding to Vulkan

1. Need for compatibility to pre-Vulkan platforms2. Heavily GPU-bound application3. Heavily CPU-bound application due to non-graphics work4. Single-threaded application, unlikely to change5. App can target middle-ware engine, avoiding 3D graphics API dependencies

• Consider using an engine targeting Vulkan, instead of coding Vulkan yourself

Page 119: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 211

Comparing OpenGL, AZDO, and VulkanIssue Naïve GL AZDO VulkanDeterministic state validation/pre-compilation

no no Yes

Improved single thread performance no Yes Yes

Multi-threaded work creation no partial yes

Multi-threaded work submission (to driver)

no no yes

GPU based work creation no partial partial (through MDI)

Ability to re-use created work no partial yes

Multi-threaded resource updates no Yes Yes

Learning curve low high Significant

Effort low high Significant

Page 120: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 212

Fish demo•Vulkan and OpenGL ES 3.1•Can change- # of schools of fish

- # of fish per school

- # of fish per drawcall

•Worker threads create commandbuffers in Vulkan mode

•Reports- Drawcalls/sec

- FPS

- CPU time per thread

- GPU time

•Android and Windows• Source code will be available soon

Page 121: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 213

200K Fishies, 100 fish per draw call

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

Geforce GTX 980 SHIELD Android TV SHIELD Tablet K1

OpenGL ES

Vulkan

drawcalls / sec

7x

1.5x

1.2x

Page 122: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 214

200K Fishies, 1 fish per draw call

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

18,000,000

Geforce GTX 980 SHIELD Android TV SHIELD Tablet K1

OpenGL ES

Vulkan

drawcalls / sec

6x5x

19x

Page 123: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 215

FISH DEMO

Page 124: GDC : Mar16.pdf

Porting Cinder to VulkanLearning to Follow RulesHai NguyenCreative Technology LeadArt Copy & Code Project

Page 125: GDC : Mar16.pdf

Vulkan: Lots of rules and no mercy.

~Joseph Campbell (paraphrased)

Page 126: GDC : Mar16.pdf

Introducing Cinder

● What’s creative coding?○ Programming with aesthetic intent

● What platforms does Cinder run on?○ Android, Linux, Windows, iOS and OS X

● Open source under Simplified BSD

C++ Creative Coding Framework | https://libcinder.org

Porting Cinder to Vulkan

Page 127: GDC : Mar16.pdf

Cinder: Who/What/Where?

● Who is Cinder’s target audience?○ Creative coders

● What is Cinder used for?○ Apps: mobile to desktop to Times Square

● Where has Cinder been used?

Audience and Projects

Porting Cinder to Vulkan

Page 128: GDC : Mar16.pdf

Grove | Simon Geilfus Planetary | BLOOM.io SCAD Museum | Pentagram

IBM THINK | Mirada Samsung CenterStage | TBG Dia Lights | Kollision

Audi Urban Future | Kollision Androidify | Red Paper Heart Taxi, Taxi! | Robert Hodgin

Porting Cinder to Vulkan: Projects That Use Cinder

Page 129: GDC : Mar16.pdf

Porting Cinder to Vulkan

● Vulkanizing Cinder

● Crossing Vendor Implementations

● Speed Bumps

The Road To Glory

Porting Cinder to Vulkan

Page 130: GDC : Mar16.pdf

Vulkanizing Cinder

● Added RendererVk to Cinder○ Cinder rendering architecture is modular

● Wrapped Vulkan in C++○ Created idiomatic layer for expression

● Created high level graphics classes○ Textures, vertex buffers, render targets, etc

Getting to the First Triangle

Porting Cinder to Vulkan

Page 131: GDC : Mar16.pdf

Vulkanizing Cinder

● Initial port on Windows: ~3wks○ Included updating GLSL to Vulkan convention

● Android and Linux port: ~3hrs (each)○ Added platform WSI calls

○ Added platform swapchain creation

● Everything else stayed the same○ Including GLSL shader code used in demos and tests

Going Cross Platform

Porting Cinder to Vulkan

Page 132: GDC : Mar16.pdf

Crossing Vendor Implementations

● Vendor implementations follow the spec○ Conformance tested

● Slightly different behaviors○ Image layout transitions in render passes

● Varying GPU limits/features○ Found in VkPhysicalDeviceLimits

Implementation Details Will Vary

Porting Cinder to Vulkan

Page 133: GDC : Mar16.pdf

Speed Bump: Image Layout Transitions

● Initial platform allowed image layouts to be LAYOUT_GENERAL○ Made it easy to get up and going

● Seemed to work on other GPUs - until one didn’t○ Why? Vendor had stricter adherence to spec

● Checked spec and added logic for transitions○ Had to rework a good bit of code

Dad Said Yes But Mom Said No

Porting Cinder to Vulkan

Page 134: GDC : Mar16.pdf

Whooops...

Porting Cinder to Vulkan

Page 135: GDC : Mar16.pdf

YAY!

Porting Cinder to Vulkan

Page 136: GDC : Mar16.pdf

Speed Bump: Not Paying Attention to Limits

● Not adhering to limits often results in crashes

● Mishandled vkCmdBindDescriptorSets○ Exceeded maxBoundDescriptorSets

● Tried to multithread on device with 1 queue○ Failed to check queue family’s queue count

VkPhysicalDeviceLimits / VkQueueFamilyProperties

Porting Cinder to Vulkan

Page 137: GDC : Mar16.pdf

No More Black Box / Fewer Black Screens

● Vulkan Specification○ Clear about requirements and expectations (mostly)

● Check Device Limits / Features at Run Time○ Easy to query in Vulkan

● Validation Layers Are Your Friends○ Turn on at day 1 - leave on until shipped

Help Vulkan Help You

Porting Cinder to Vulkan

Page 138: GDC : Mar16.pdf

Antoine LabourE. Greg DanielJesse HallShannon WoodsDaniel KochJeff BolzMathias HeyerPiers DaniellTristan LorachJohn McDonaldDominik Witczak

Special Thanks

Page 139: GDC : Mar16.pdf

Thank You!Hai Nguyen

https://libcinder.org

Page 140: GDC : Mar16.pdf

GFXBench 5 - Aztec RuinsBenchmarking Vulkan

Gergely Juhasz, Lead Gfx Engineer @Kishonti

Page 141: GDC : Mar16.pdf

GFXBench 5 in a nutshell

• Concept• Working title: Aztec Ruins

• Entirely new rendering engine• In-house render API for Vulkan, Metal, DX12• Also on OpenGL 4.3+, ES 3.2, DX11 for comparison• Algorithmic and workload parity across different backends

• High-end graphics features• Real time dynamic GI• Complex shading and advanced post-effects

• State• Near to Beta• Gold version expected by Q3

Page 142: GDC : Mar16.pdf

Actual engine footage

Page 143: GDC : Mar16.pdf

Render pipeline – Direct lights

Page 144: GDC : Mar16.pdf

Render pipeline – Dynamic shadows

Page 145: GDC : Mar16.pdf

Render pipeline – Global illumination

Page 146: GDC : Mar16.pdf

Render pipeline – Post-process

Page 147: GDC : Mar16.pdf

Global illumination

• Probes capture the lighting conditions

• SH is generated for every probe

• Final scene is shaded by deferred irradiance lights

• Well fits in Vulkan’s subpass concept

Page 148: GDC : Mar16.pdf

Subpass 1 – Geometry

Page 149: GDC : Mar16.pdf

Subpass 2 – Lighting

Page 150: GDC : Mar16.pdf

Final step – Post effects

Page 151: GDC : Mar16.pdf

Multi-threaded command recording 1

Render job Render targets

Render states

Drawcalls

A B

D EC

F

Dependency graphPipeline consists of several render jobs

Page 152: GDC : Mar16.pdf

Multi-threaded command recording 2

Command buffer

Command buffer

Command buffer

Command buffer

Main thread Command queue

Main rendering thread submits the command buffers according to the dependency graph

Page 153: GDC : Mar16.pdf

Future development plans

• Planned rendering features• Indirect specular highlights and shadows by GI

• Deferred decals

• Animated vegetation

• Compute based motion blur

• Atmospheric effects, particles

• VR

Page 154: GDC : Mar16.pdf
Page 155: GDC : Mar16.pdf
Page 156: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 208

Comparing Vulkan to OpenGL (ES)

Barthold LichtenbeltMarch 16, 2016

Page 157: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 209

Beneficial Vulkan Scenarios

Is your graphicswork CPU bound?

Can your graphicscreation be parallelized?

start

yes

Vulkanfriendly

Your graphicsplatform is fixed

You’lldo what it

takes to squeeze outMax perf.

You put a premium on

avoidinghitches

You canmanage your

graphics resourceallocations

yes

yes

yes

yes

yes

Page 158: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 210

Unlikely to Benefit

Scenarios to reconsider coding to Vulkan

1. Need for compatibility to pre-Vulkan platforms2. Heavily GPU-bound application3. Heavily CPU-bound application due to non-graphics work4. Single-threaded application, unlikely to change5. App can target middle-ware engine, avoiding 3D graphics API dependencies

• Consider using an engine targeting Vulkan, instead of coding Vulkan yourself

Page 159: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 211

Comparing OpenGL, AZDO, and VulkanIssue Naïve GL AZDO VulkanDeterministic state validation/pre-compilation

no no Yes

Improved single thread performance no Yes Yes

Multi-threaded work creation no partial yes

Multi-threaded work submission (to driver)

no no yes

GPU based work creation no partial partial (through MDI)

Ability to re-use created work no partial yes

Multi-threaded resource updates no Yes Yes

Learning curve low high Significant

Effort low high Significant

Page 160: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 212

Fish demo•Vulkan and OpenGL ES 3.1•Can change- # of schools of fish

- # of fish per school

- # of fish per drawcall

•Worker threads create commandbuffers in Vulkan mode

•Reports- Drawcalls/sec

- FPS

- CPU time per thread

- GPU time

•Android and Windows• Source code will be available soon

Page 161: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 213

200K Fishies, 100 fish per draw call

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

Geforce GTX 980 SHIELD Android TV SHIELD Tablet K1

OpenGL ES

Vulkan

drawcalls / sec

7x

1.5x

1.2x

Page 162: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 214

200K Fishies, 1 fish per draw call

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

18,000,000

Geforce GTX 980 SHIELD Android TV SHIELD Tablet K1

OpenGL ES

Vulkan

drawcalls / sec

6x5x

19x

Page 163: GDC : Mar16.pdf

© Copyright Khronos Group 2016 - Page 215

FISH DEMO


Related Documents