Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

HETEROGENEOUS PARTICLE

BASED SIMULATION

Takahiro Harada, AMD

2 Harada, Heterogeneous Particle-based Simulation

Large number of particles

Particles with identical size

– Work granularity is almost the same

– Good for the wide SIMD architecture

PARTICLE BASED SIMULATION ON THE GPU

Harada et al. 2007


PARTICLE BASED SIMULATION

Collision

Integration

Acceleration structure is used for efficient collide

– Uniform grid → Suited for the GPU

– Less divergence

𝑓𝑐𝑜𝑙𝑙𝑖𝑑𝑒 = 𝑓𝑖𝑗

𝑣 +=𝑓

𝑚∆𝑡

𝑥 += 𝑣∆𝑡

𝑑𝑣

𝑑𝑡=𝑓

𝑚

𝑑𝑥

𝑑𝑡= 𝑣


DIVERGENCE ON SIMD

0 1 2 3 4 5 6 7

Void Kernel()

{

if(A)

FuncA();

else if(B)

FuncB();

else

FuncC();

}


PARTICLE BASED SIMULATION ON THE GPU

Particle collision using a uniform grid

0 1 2 3 4 5 6 7

Void Kernel()

{

prepare();

collide(Cell0);

collide(Cell1);

collide(Cell2);

collide(Cell3);

collide(Cell4);

collide(Cell5);

collide(Cell6);

collide(Cell7);

collide(Cell8);

}

Cell0 Cell1 Cell2

Cell3 Cell4 Cell5

Cell6 Cell7 Cell8


MIXED PARTICLE SIMULATION

Not only small particles

Difficulty for GPUs

– Large particles interact with small particles

– Large-large collision


CHALLENGE

Non uniform work granularity

– Small-small(SS) collision

Uniform, GPU

– Large-large(LL) collision

Non Uniform, CPU

– Large-small(LS) collision

Non Uniform, CPU


FUSION ARCHITECTURE

CPU and GPU are:

– On the same die

– Much closer

– Efficient data sharing

CPU and GPU are good at different works

– CPU: serial computation, conditional branch

– GPU: parallel computation

Able to dispatch works to:

– Serial work with varying granularity → CPU

– Parallel work with the uniform granularity → GPU


MIXED PARTICLE SIMULATION

Benefit from Fusion Architecture

– Different works in a simulation

– CPU & GPU are working together

– Shares data


METHOD


TWO SIMULATIONS

Small particles

Large particles

Build

Acc. Structure

SS

Collision

S

Integration

Build

Acc. Structure

LL

Collision

L

Integration

LS

Co

llis

ion

Position

Velocity

Force

Grid

Position

Velocity

Force


Small particles

Large particles

Uniform Work

Non Uniform Work

CLASSIFY BY WORK GRANULARITY

Build

Acc. Structure

SS

Collision

S

Integration

L

Integration

Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

LS

Collision

Build

Acc. Structure


Small particles

Large particles

GPU

CPU

CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR

Build

Acc. Structure

SS

Collision

S

Integration

L

Integration

Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

LS

Collision

Build

Acc. Structure


Small particles

Large particles

Grid, small particle data has to be shared with the CPU for LS collision

– Allocated as zero copy buffer

GPU

CPU

DATA SHARING

Build

Acc. Structure

SS

Collision

S

Integration

L

Integration

Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

Build

Acc. Structure

Position

Velocity

Grid

Force

LS

Collision


Small particles

Large particles

Grid, small particle data has to be shared with the CPU for LS collision

– Allocated as zero copy buffer

GPU

CPU

SYNCHRONIZATION

Position

Velocity

Force

Grid

Position

Velocity

Force

SS

Collision

S

Integration

L

Integration

LL

Collision

Position

Velocity

Grid

Force

Syn

ch

ron

iza

tio

n

LS

Collision

Build

Acc. Structure

Build

Acc. Structure

Syn

ch

ron

iza

tio

n

Build

Acc. Structure

Build

Acc. Structure


GPU

CPU

VISUALIZING WORKLOADS

Build

Acc. Structure

SS

Collision

S

Inte

gra

tio

n Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

LS

Collision

Syn

ch

ron

iza

tio

n

L

Inte

gra

tio

n

Small particles

Large particles

Grid construction can be moved at the end of the pipeline

– Unbalanced workload


Small particles

Large particles

To get better load balancing

– The sync is for passing the force buffer filled by the CPU to the GPU

– Move the LL collision after the sync

GPU

CPU

LOAD BALANCING

Build

Acc. Structure

SS

Collision

S

Inte

gra

tio

n Position

Velocity

Force

Grid

Position

Velocity

Force LL

Collision

Syn

ch

ron

iza

tio

n

L

Inte

gra

tio

n

LS

Collision


GP

U W

ork

CP

U W

ork


MULTI THREADING

(4 THREADS)


FURTHER OPTIMIZATION

GPU

CPU0

CPU1

CPU2

Build

Acc.

Structure

SS

Collision

S

Inte

g.

LL

Collision

L

Inte

g.

LS

Collision

Syn

ch

ron

iza

tio

n

1. Not optimized for “Llano” which is a 4 core CPU

– Only 2 CPU core were used

– Can use 2 more cores for LS collision

2. LL collision was not optimized

– CPU waits when the GPU was constructing a grid

– Use CPU to improve SS collision


OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION

Cannot split the work by large particle indices

– More than 1 large particle can collide with a small particle

– Have to lock the memory on write → Inefficient

Prepare a local buffer for a thread

– A buffer storing force on small particles

– Lock free

Local buffers are merged to one

L0

S0

S1

L1

Thread0

Thread1

Thread2



GPU

Build

Acc. Structure

SS

Collision

S

Inte

g.

CPU0

LL

Collision

L

Inte

g.

CPU1

CPU2

LS

Collision

Syn

ch

ron

iza

tio

n



GPU

Build

Acc. Structure

SS

Collision

S

Inte

g.

CPU0

LL

Collision

L

Inte

g.

CPU1

CPU2

LS

Collision

LS

Collision

LS

Collision S

yn

ch

ron

iza

tio

n

Merg

e

Merg

e

Merg

e

Syn

ch

ron

iza

tio

n


Spatially coherent memory layout improves cache utilization

As particles move, spatial locality decreases

OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION


Spatially coherent memory layout improves cache utilization

As particles move, spatial locality decreases



Sort particles by spatial location to improve cache utilization

– Z curve

SPATIAL SORT


Sort particles by spatial location to improve cache utilization

– Z curve

SPATIAL SORT


Requirements

– Full sort was over the budget

– Full sort is not “a must”

– Sort is an optional computation for performance improvement

– Incremental sort

– Use multiple threads

Solution

– Used generalized “Odd-even transition sort”

CHOOSE SORT


BLOCK TRANSITION SORT

Generalized “Odd-even transition sort”

Instead of sorting 2 adjacent elements, sort adjacent 2 blocks

Iterate until convergence

Use a thread to sort 2 adjacent blocks

– 6 blocks for 3 threads

– Radix sort

Odd-even transition sort

Block transition sort



GPU

Build

Acc. Structure

SS

Collision

S

Inte

g.

CPU0

LL

Collision

L

Inte

g.

CPU1

CPU2

LS

Collision

LS

Collision

LS

Collision S

yn

ch

ron

iza

tio

n

Merg

e

Merg

e

Merg

e

Syn

ch

ron

iza

tio

n



GPU

Build

Acc. Structure

SS

Collision

S

Inte

g.

CPU0

CPU1

CPU2

LS

Collision

LS

Collision

LS

Collision S

yn

ch

ron

iza

tio

n

Merg

e

Merg

e

Merg

e

LL

Co

ll.

L

Inte

g.

Syn

ch

ron

iza

tio

n

S Sorting

S Sorting

S Sorting

Syn

ch

ron

iza

tio

n


DEMO

GP

U W

ork

CP

U W

ork


DEMO

GP

U W

ork

CP

U W

ork


CONCLUSIONS

Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU

and GPU on AMD’s Fusion Architecture

– The CPU is used for works with non identical compute granularity

– The GPU is used for highly parallel works

Memory sharing between the CPU and GPU is the key for the efficiency

– Avoid wasteful memory copies


REFERENCE

Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs,

Proc. of Computer Graphics International, 63-70(2007)

Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation,

Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011)

Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

Technology

gpu particle collision

simulation cpu gpu

structure ll collision

gpu harada

cpu largesmallls collision

gpu largelargell collision

simulation takahiro

small particle data