A Hardware Pipeline for Accelerating Ray Traversal Algorithms

8/6/2019 A Hardware Pipeline for Accelerating Ray Traversal Algorithms

1/59

A Hardware Pipeline for Accelerating

Ray traversal Algorithms onStreaming Processors


2/59

Introduction

Ray tracing

Ray tracing algorithms

Ray traversal hardware pipeline

Streaming processors

GPGPU

Performance degradation of 1.5X-2.5X

2Roll No:7 Mtech CSIS FISATJanuary 11


3/59

Introduction

2 stage traversal process

1. Hardware implementation2. User defined algorithm



4/59

Introduction

Performance Simulator created

streaming processor architecture

Kd tree as software traversal algorithm

Software traversal reduced by 32X

Instruction executed reduced by 2.15X.

Roll No:7 Mtech CSIS FISATJanuary 11

4


5/59

Previous Work

Accelerated Data Structures Hierarchical Space Subdivision Schemes

Bounding Volume Hierarchies

GPU implementations Vector operations

Graphics Hardware

Large programmable multi-core architectures

Graphics computations in parallel

Multiple threads on each processor

Software kernels

Vector operations and vectorized processors 5Roll No:7 Mtech CSIS FISATJanuary 11


6/59

Pipeline Traversal Algorithm

Group Uniform Grid (GrUG)

Axis-aligned subdivision of space

Two hierarchical layers

Top Layer

L

owerL

ayer



7/59


8/59

Grid Concepts

Spatial Subdivisions


8


9/59

Stepping Between Neighbours

DDA method is used

tmax , delta and step



10/59

Ray projection from original GrUG grouping in A to next GrUG

grouping in B. To compute the next point along the ray for the

hash function,the ray is projected by the tmin value.



11/59

A

DC

KD-Tree

B

X

Y

Z

X

Y Z

A B C D

tmin

tmax



12/59

DC

A

B

X

Y

Z

KD-Tree Traversal

X

Y Z

A B C D



13/59

DC

A

B

X

Y

Z

Observation

X

Y Z

A B C D

Current leafs tmax Next leafs tmin= 13Roll No:7 Mtech CSIS FISATJanuary 11


14/59

Overview of GrUG

2 spatial seperation methods

Uniform Grid

GrUG groups Traversal of GrUG

Hash Table

Performs 2 mappings Input:ray location

Output:memory address of GrUG group



15/59

Hash function starting with X,Y,Z coordinates and outputting the

memory address of a GrUG grouping that can be passed to a software

traversal algorithm.



16/59

Hash function implementation

3 axes concatenated to form CellID

Allows parallel processing


16


17/59

Hash Function Implementation



18/59

Architecture of Group Uniform Grid



19/59

Data Structure Creation

2 memory spaces

Hash table

User defined tree data structure

Starts at GrUG groupings

Kd tree is used

Uniform grid structure Only leaf nodes need to be present in memory



20/59

Pipeline Architecture

Standalone processing block inside processor

Fixed Hardware

Memory address registers Ray Projection

Ray undergoes GrUG traversal

Read bounding box of the GrUG groups

tmax value is computed



21/59

Pipeline architecture

Rays per clock cycle

Pipeline stages can be vectorized

Ideal for streaming processors


21


22/59

Integration of the GrUG pipeline into a multi-core

graphics processor

and the fixed hardware stages for the GrUG pipeline.



23/59

Hash Function

Determine grid cell of a ray

Grid cell id to memory address

Locate root node for software traversal

Input: Ray location (x,y,z)

Output: 9 bit value from each hash functionpipeline

Maximum grid size support 512 X 512 X 512

Floating point values from -1.0 to 1.0



24/59

Architecture of GrUG hash function for

one axis using a 512 grid

24Roll No:7 Mtech CSIS F

ISAT

January 11


25/59

Implementation Simulator

GPGPU SIM simulator

PTX assembly files generated-NVIDIA NVCC

compiler

PTX assembly code modification


ISAT

January 11


26/59

Implementation

Kernel Code

Ray generation

Post GrUG traversal operation

Read selected GrUG grouping bounding box

Compute rays tmax value

Kd tree algorithm

Radius CUDA

Ray triangle intersection

Walds algorithm


ISAT

January 11


27/59

Kernel Code


ISATJanuary 11


28/59

Benchmark Scenes

8 scenes

Resolution 512 X 512


ISATJanuary 11


29/59

Roll No:7 Mtech CSIS F

ISATJanuary 11 29


30/59

Roll No:7 Mtech CSIS F

ISATJanuary 11 30


31/59

Roll No:7 Mtech CSI

S FI

SATJanuary 11 31


32/59

Roll No:7 Mtech CSI

S FI

SATJanuary 11 32


33/59

Roll No:7 Mtech CSI

S FI

SATJanuary 11 33


34/59

Roll No:7 Mtech CSI

S FI

SATJanuary 11 34


35/59


36/59

Roll No:7 Mtech CSIS FISAT

January 11 36


37/59


January 11 37


38/59


January 11 38


39/59


January 11 39


40/59


January 11 40


41/59


January 11 41


42/59


January 11 42


43/59


44/59


January 11 44


45/59


January 11 45


46/59


January 11 46


47/59


January 11 47


48/59

Results

a) Performance

Relative speedup over brute-force intersection.

12.9

Box Bunny Robots Kitchen

48Roll No:7 Mtech CSIS FISAT

January 11


49/59

Performance Results

Reduced the number o f tree traversal steps by 32.5xfor visible rays.

Overall Speedup : Average 1.6X for visible rays

Performance for grid size of 128 is improved over

software implementation

by 1.9X compared to 2.15X

for a grid size of 512.

Conference benchmark

scene at resolution 128


January 11 49


50/59

Results

b) Memory

50Roll No:7 Mtech CSIS FISAT

January 11


51/59

Memory Requirements

Overhead of storing hash table in memory

4 bytes / grid cell -> 4,294,967,296 GrUG groups

512 MB hash table

2 bytes / grid cell -> 65536 GrUG groups

256 MB hash table

Smaller grid size -> upto 4MB hash table

128 grid size -> 1.5 times memory of kd tree 512 grid size -> 27.6 times memory of kd tree


January 11 51


52/59

Memory Requirements

Smaller grid sizes are more efficient

Balance between performance and memory

Stores kd tree structure

bounding dimensions of threshold nodes

Similar memory requirement for storing a full

kd tree.


January 11 52


53/59

Results

c) Bandwidth

53Roll No:7 Mtech CSIS FISAT January 11


54/59

Bandwidth requirements

Average memory bandwidth per frame issmaller

Less down tree traversals -> less device

memory transactions Bandwidth is used for post GrUG software

traversal

GrUG Memory bandwidth + down treetraversal < down traversals by full softwareimplementation


January 11 54


55/59

Advantages

Maintains user programmability

Increases ray tracing performance

Diverse implementation scope

55


January 11


56/59

Conclusion

New graphics hardware architecture

Small fixed hardware pipeline

Offload part of the acceleration traversalcomputations

Diverse implementation scope of processor

architecture

User programmability

Overall run time performance

56


January 11


57/59

Future Work

57


January 11


58/59

References

[1] Algorithm for 3D digital differential algorithm

CG351-551 Raytracing Algorithm for 3DDDA.htm

[2] Introduction to GRIDS

flipcode - Raytracing Topics & Techniques.mht

[3] KD-Tree Acceleration Structures for a GPU Raytracer.

Tim Foley, Jeremy Sugerman Stanford University

[4] Design and Evaluation of a Hardware Accelerated Ray Tracing Data Structure

Michael Steffen and Joseph Zambreno , Department of Electrical and Computer Engineering

Iowa State University, USA.

[5] Analyzing CUDA Workloads Using a Detailed GPU Simulator

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt

University of British Columbia,Vancouver, BC, Canada,

{bakhoda,gyuan,wwlfung,henryw,aamodt}@ece.ubc.ca[6] Ray Tracing on a GPU with CUDA Comparative Study of Three Algorithms

Martin Zlatuka Czech Technical University in Prague,Faculty of Electrical Engineering

Czech Republic,zlatum1{@}fel.cvut.cz

[7] Wikepedia, Ray Tracing basics.


January 11 58


59/59

Thank you


A Hardware Pipeline for Accelerating Ray Traversal Algorithms

Documents