HETEROGENEOUS PARTICLE BASED SIMULATION Takahiro Harada, AMD
Jun 11, 2015
HETEROGENEOUS PARTICLE
BASED SIMULATION
Takahiro Harada, AMD
2 Harada, Heterogeneous Particle-based Simulation
Large number of particles
Particles with identical size
– Work granularity is almost the same
– Good for the wide SIMD architecture
PARTICLE BASED SIMULATION ON THE GPU
Harada et al. 2007
3 Harada, Heterogeneous Particle-based Simulation
PARTICLE BASED SIMULATION
Collision
Integration
Acceleration structure is used for efficient collide
– Uniform grid → Suited for the GPU
– Less divergence
𝑓𝑐𝑜𝑙𝑙𝑖𝑑𝑒 = 𝑓𝑖𝑗
𝑣 +=𝑓
𝑚∆𝑡
𝑥 += 𝑣∆𝑡
𝑑𝑣
𝑑𝑡=𝑓
𝑚
𝑑𝑥
𝑑𝑡= 𝑣
4 Harada, Heterogeneous Particle-based Simulation
DIVERGENCE ON SIMD
0 1 2 3 4 5 6 7
Void Kernel()
{
if(A)
FuncA();
else if(B)
FuncB();
else
FuncC();
}
5 Harada, Heterogeneous Particle-based Simulation
PARTICLE BASED SIMULATION ON THE GPU
Particle collision using a uniform grid
0 1 2 3 4 5 6 7
Void Kernel()
{
prepare();
collide(Cell0);
collide(Cell1);
collide(Cell2);
collide(Cell3);
collide(Cell4);
collide(Cell5);
collide(Cell6);
collide(Cell7);
collide(Cell8);
}
Cell0 Cell1 Cell2
Cell3 Cell4 Cell5
Cell6 Cell7 Cell8
6 Harada, Heterogeneous Particle-based Simulation
MIXED PARTICLE SIMULATION
Not only small particles
Difficulty for GPUs
– Large particles interact with small particles
– Large-large collision
7 Harada, Heterogeneous Particle-based Simulation
CHALLENGE
Non uniform work granularity
– Small-small(SS) collision
Uniform, GPU
– Large-large(LL) collision
Non Uniform, CPU
– Large-small(LS) collision
Non Uniform, CPU
8 Harada, Heterogeneous Particle-based Simulation
FUSION ARCHITECTURE
CPU and GPU are:
– On the same die
– Much closer
– Efficient data sharing
CPU and GPU are good at different works
– CPU: serial computation, conditional branch
– GPU: parallel computation
Able to dispatch works to:
– Serial work with varying granularity → CPU
– Parallel work with the uniform granularity → GPU
9 Harada, Heterogeneous Particle-based Simulation
MIXED PARTICLE SIMULATION
Benefit from Fusion Architecture
– Different works in a simulation
– CPU & GPU are working together
– Shares data
10 Harada, Heterogeneous Particle-based Simulation
METHOD
11 Harada, Heterogeneous Particle-based Simulation
TWO SIMULATIONS
Small particles
Large particles
Build
Acc. Structure
SS
Collision
S
Integration
Build
Acc. Structure
LL
Collision
L
Integration
LS
Co
llis
ion
Position
Velocity
Force
Grid
Position
Velocity
Force
12 Harada, Heterogeneous Particle-based Simulation
Small particles
Large particles
Uniform Work
Non Uniform Work
CLASSIFY BY WORK GRANULARITY
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Build
Acc. Structure
13 Harada, Heterogeneous Particle-based Simulation
Small particles
Large particles
GPU
CPU
CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Build
Acc. Structure
14 Harada, Heterogeneous Particle-based Simulation
Small particles
Large particles
Grid, small particle data has to be shared with the CPU for LS collision
– Allocated as zero copy buffer
GPU
CPU
DATA SHARING
Build
Acc. Structure
SS
Collision
S
Integration
L
Integration
Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
Build
Acc. Structure
Position
Velocity
Grid
Force
LS
Collision
15 Harada, Heterogeneous Particle-based Simulation
Small particles
Large particles
Grid, small particle data has to be shared with the CPU for LS collision
– Allocated as zero copy buffer
GPU
CPU
SYNCHRONIZATION
Position
Velocity
Force
Grid
Position
Velocity
Force
SS
Collision
S
Integration
L
Integration
LL
Collision
Position
Velocity
Grid
Force
Syn
ch
ron
iza
tio
n
LS
Collision
Build
Acc. Structure
Build
Acc. Structure
Syn
ch
ron
iza
tio
n
Build
Acc. Structure
Build
Acc. Structure
16 Harada, Heterogeneous Particle-based Simulation
GPU
CPU
VISUALIZING WORKLOADS
Build
Acc. Structure
SS
Collision
S
Inte
gra
tio
n Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
LS
Collision
Syn
ch
ron
iza
tio
n
L
Inte
gra
tio
n
Small particles
Large particles
Grid construction can be moved at the end of the pipeline
– Unbalanced workload
17 Harada, Heterogeneous Particle-based Simulation
Small particles
Large particles
To get better load balancing
– The sync is for passing the force buffer filled by the CPU to the GPU
– Move the LL collision after the sync
GPU
CPU
LOAD BALANCING
Build
Acc. Structure
SS
Collision
S
Inte
gra
tio
n Position
Velocity
Force
Grid
Position
Velocity
Force LL
Collision
Syn
ch
ron
iza
tio
n
L
Inte
gra
tio
n
LS
Collision
18 Harada, Heterogeneous Particle-based Simulation
GP
U W
ork
CP
U W
ork
19 Harada, Heterogeneous Particle-based Simulation
MULTI THREADING
(4 THREADS)
20 Harada, Heterogeneous Particle-based Simulation
FURTHER OPTIMIZATION
GPU
CPU0
CPU1
CPU2
Build
Acc.
Structure
SS
Collision
S
Inte
g.
LL
Collision
L
Inte
g.
LS
Collision
Syn
ch
ron
iza
tio
n
1. Not optimized for “Llano” which is a 4 core CPU
– Only 2 CPU core were used
– Can use 2 more cores for LS collision
2. LL collision was not optimized
– CPU waits when the GPU was constructing a grid
– Use CPU to improve SS collision
21 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION
Cannot split the work by large particle indices
– More than 1 large particle can collide with a small particle
– Have to lock the memory on write → Inefficient
Prepare a local buffer for a thread
– A buffer storing force on small particles
– Lock free
Local buffers are merged to one
L0
S0
S1
L1
Thread0
Thread1
Thread2
22 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Inte
g.
CPU0
LL
Collision
L
Inte
g.
CPU1
CPU2
LS
Collision
Syn
ch
ron
iza
tio
n
23 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Inte
g.
CPU0
LL
Collision
L
Inte
g.
CPU1
CPU2
LS
Collision
LS
Collision
LS
Collision S
yn
ch
ron
iza
tio
n
Merg
e
Merg
e
Merg
e
Syn
ch
ron
iza
tio
n
24 Harada, Heterogeneous Particle-based Simulation
Spatially coherent memory layout improves cache utilization
As particles move, spatial locality decreases
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
25 Harada, Heterogeneous Particle-based Simulation
Spatially coherent memory layout improves cache utilization
As particles move, spatial locality decreases
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
26 Harada, Heterogeneous Particle-based Simulation
Sort particles by spatial location to improve cache utilization
– Z curve
SPATIAL SORT
27 Harada, Heterogeneous Particle-based Simulation
Sort particles by spatial location to improve cache utilization
– Z curve
SPATIAL SORT
28 Harada, Heterogeneous Particle-based Simulation
Requirements
– Full sort was over the budget
– Full sort is not “a must”
– Sort is an optional computation for performance improvement
– Incremental sort
– Use multiple threads
Solution
– Used generalized “Odd-even transition sort”
CHOOSE SORT
29 Harada, Heterogeneous Particle-based Simulation
BLOCK TRANSITION SORT
Generalized “Odd-even transition sort”
Instead of sorting 2 adjacent elements, sort adjacent 2 blocks
Iterate until convergence
Use a thread to sort 2 adjacent blocks
– 6 blocks for 3 threads
– Radix sort
Odd-even transition sort
Block transition sort
30 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Inte
g.
CPU0
LL
Collision
L
Inte
g.
CPU1
CPU2
LS
Collision
LS
Collision
LS
Collision S
yn
ch
ron
iza
tio
n
Merg
e
Merg
e
Merg
e
Syn
ch
ron
iza
tio
n
31 Harada, Heterogeneous Particle-based Simulation
OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
GPU
Build
Acc. Structure
SS
Collision
S
Inte
g.
CPU0
CPU1
CPU2
LS
Collision
LS
Collision
LS
Collision S
yn
ch
ron
iza
tio
n
Merg
e
Merg
e
Merg
e
LL
Co
ll.
L
Inte
g.
Syn
ch
ron
iza
tio
n
S Sorting
S Sorting
S Sorting
Syn
ch
ron
iza
tio
n
32 Harada, Heterogeneous Particle-based Simulation
DEMO
GP
U W
ork
CP
U W
ork
33 Harada, Heterogeneous Particle-based Simulation
DEMO
GP
U W
ork
CP
U W
ork
34 Harada, Heterogeneous Particle-based Simulation
CONCLUSIONS
Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU
and GPU on AMD’s Fusion Architecture
– The CPU is used for works with non identical compute granularity
– The GPU is used for highly parallel works
Memory sharing between the CPU and GPU is the key for the efficiency
– Avoid wasteful memory copies
35 Harada, Heterogeneous Particle-based Simulation
REFERENCE
Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs,
Proc. of Computer Graphics International, 63-70(2007)
Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation,
Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011)