FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams [email protected]
Dec 19, 2015
FUTURE TECHNOLOGIES GROUP
1
Auto-tuning Memory Intensive Kernels for Multicore
Sam Williams [email protected]
FUTURE TECHNOLOGIES GROUP
Outline
1. Challenges arising from Optimizing Single Thread Performance
2. New Challenges Arising when Optimizing Multicore SMP Performance
3. Performance Modeling and Little’s Law
4. Multicore SMPs of Interest
5. Auto-tuning Sparse Matrix-Vector Multiplication (SpMV)
6. Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD)
7. Summary
2
FUTURE TECHNOLOGIES GROUP
Challenges arising fromOptimizing Single Thread
Performance
3
FUTURE TECHNOLOGIES GROUP
Instruction-Level Parallelism
On modern pipelined architectures, operations (like floating-point addition) have a latency of 4-6 cycles (until the result is ready).
However, independent adds can be pipelined one after another. Although this increases the peak flop rate,
one can only achieve peak flops on the condition that on any given cycle the program has >4 independent adds ready to execute.
failing to do so will result in a >4x drop in performance. The problem is exacerbated by superscalar or VLIW architectures
like POWER or Itanium.
One must often reorganize kernels to express more instruction-level parallelism
4
FUTURE TECHNOLOGIES GROUP
1x1 Register Block
ILP Example (1x1 BCSR)
for(all rows){ y0 = 0.0; for(all tiles in this row){ y0+=V[i]*X[C[i]] } y[r] = y0;}
Consider the core of SpMV No ILP in the inner loop OOO can’t accelerate serial FMAs
FMAFMA
time
=0
FMAFMA
time
=4
FMAFMA
time
=8
FMAFMA
time
=1
2
FMAFMA
time
=1
6
FUTURE TECHNOLOGIES GROUP
ILP Example (1x4 BCSR)
for(all rows){ y0 = 0.0; for(all tiles in this row){ y0+=V[i ]*X[C[i] ] y0+=V[i+1]*X[C[i]+1] y0+=V[i+2]*X[C[i]+2] y0+=V[i+3]*X[C[i]+3] } y[r] = y0;}
What about 1x4 BCSR ? Still no ILP in the inner loop FMAs are still dependent on each other
FMAFMA
time
=0
FMAFMA
time
=4
FMAFMA
time
=8
FMAFMA
time
=1
2
1x4 Register Block
FUTURE TECHNOLOGIES GROUP
ILP Example (4x1 BCSR)
for(all rows){ y0 = 0.0;y1 = 0.0; y2 = 0.0;y3 = 0.0; for(all tiles in this row){ y0+=V[i ]*X[C[i]] y1+=V[i+1]*X[C[i]] y2+=V[i+2]*X[C[i]] y3+=V[i+3]*X[C[i]] } y[r+0] = y0; y[r+1] = y1; y[r+2] = y2; y[r+3] = y3;}
What about 4x1 BCSR ? Updating 4 different rows The 4 FMAs are independent Thus they can be pipelined.
FMAFMA
time
=0
FMAFMA
time
=1
FMAFMA
time
=2
FMAFMA
time
=3
4x1 Register Block
FMAFMA
time
=4
FMAFMA
time
=5
FMAFMA
time
=6
FMAFMA
time
=7
FUTURE TECHNOLOGIES GROUP
Data-level Parallelism
DLP = apply the same operation to multiple independent operands.
Today, rather than relying on superscalar issue, many architectures have adopted SIMD as an efficient means of boosting peak performance. (SSE, Double Hummer, AltiVec, Cell, GPUs, etc…)
Typically these instructions operate on four single precision
(or two double precision) numbers at a time. However, some are more GPUs(32), Larrabee(16), and AVX(8) Failing to use these instructions may cause a 2-32x drop in
performance Unfortunately, most compilers utterly fail to generate these
instructions.8
++ ++ ++ ++
FUTURE TECHNOLOGIES GROUP
Memory-Level Parallelism (1)
Although caches may filter many memory requests, in HPC many memory references will still go all the way to DRAM.
Memory latency (as measured in core cycles) grew by an order of magnitude in the 90’s
Today, the latency of a memory operation can exceed 200 cycles (1 double every 80ns is unacceptably slow).
Like ILP, we wish to pipeline requests to DRAM Several solutions exist today
HW stream prefetchers HW Multithreading (e.g. hyperthreading) SW line prefetch DMA
9
FUTURE TECHNOLOGIES GROUP
Memory-Level Parallelism (2)
HW stream prefetchers are by far the easiest to implement and exploit.
They detect a series of consecutive cache misses and speculate that the next addresses in the series will be needed. They then prefetch that data into the cache or a dedicated buffer.
To effectively exploit a HW prefetcher, ensure your array references accesses 100’s of consecutive addresses.
e.g. read A[i]…A[i+255] without any jumps or discontinuities
This force limits the effectiveness (shape) of the cache blocking you implemented in HW1 as you accessed:
A[(j+0)*N+i]…A[(j+0)*N+i+B], jumpA[(j+1)*N+i]…A[(j+1)*N+i+B], jumpA[(j+2)*N+i]…A[(j+2)*N+i+B], jump…
10
FUTURE TECHNOLOGIES GROUP
Branch Misprediction
A mispredicted branch can stall subsequent instructions by ~10 cycles.
Select a loop structure that maximizes the loop length
(keeps mispredicted branches per instruction to a minimum)
Some architectures support predication either in hardware or software to eliminate branches (transforms control dependencies into data dependencies)
11
FUTURE TECHNOLOGIES GROUP
Cache Subtleties
Set associative caches have a limited number of sets (S) and
ways (W), the product of which is the capacity (in cache lines). As seen in HW1, it can be beneficial to reorganize kernels to
reduce the working size and eliminate capacity misses.
Conflict misses can severely impair performance, be very challenging to identify and eliminate.
Given address may only be placed in W different locations in the cache.
Poor access patterns or roughly power of two problem sizes can be especially bad
Results in too many addresses mapped to the same set. Not all of them can be kept in the cache and some will have to be evicted.
Padding arrays (problem sizes) or skewing access pattern can eliminate conflict misses.
12
FUTURE TECHNOLOGIES GROUP
Array padding Example
Padding changes the data layout Consider a large matrix with a power of two number of
double A[N][M];// column major with M~pow2 A[i][j] and A[i+1][j] will likely be mapped to the same set.
We can pad each column with a couple extra rows
double A[N][M+pad];
Such techniques are applicable in many other domains (stencils, lattice-boltzman methods, etc…)
13
FUTURE TECHNOLOGIES GROUP
New Challenges Arising whenOptimizing Multicore SMP Performance
14
FUTURE TECHNOLOGIES GROUP
What are SMPs ?
SMP = shared memory parallel. Multiple chips (typically < 32 threads) can address any location in a
large shared memory through a network or bus Caches are almost universally coherent You can still run MPI on an SMP, but
you trade free (always pay for it) cache-coherency traffic for additional memory traffic (for explicit communication)
you trade user-level function calls for system calls Alternately, you use a SPMD threading model
(pthreads, OpenMP, UPC)
If communication between cores or threads is significant, then threaded implementations win out.
As computation:communication ratio increases, MPI asymptotically approached threaded implementations.
15
FUTURE TECHNOLOGIES GROUP
What is multicore ?What are multicore SMPs ?
16
Today, multiple cores are integrated on the same chip Almost universally this is done in a SMP fashion For “convince”, programming multicore SMPs is indistinguishable
from programming multi-socket SMPs. (easy transition)
Multiple cores can share: memory controllers caches occasionally FPUs
Although there was a graceful transition
from multiple sockets to multiple cores
from the point of view of correctness,
achieving good performance can be
incredibly challenging.
FUTURE TECHNOLOGIES GROUP
Affinity
17
0 12 34 56 7 0 12 3 4 5 6 7
We may wish one pair of threads to share a cache, but be disjoint from another pair of threads.
We can control the mapping of threads to linux processors via #include<sched.h> + sched_set/getaffinity()
But, mapping of linux processors to physical cores/sockets is machine/OS dependent.
Inspect /proc/cpuinfo or use PLPA
FUTURE TECHNOLOGIES GROUP
NUMA Challenges
18
Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among
between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory
FUTURE TECHNOLOGIES GROUP
Implicit allocation for NUMA
Consider an OpenMP example for implicitly NUMA initialization:
#pragma omp parallel for for (j=0; j<N; j++) { a[j] = 1.0; b[j] = 2.0; c[j] = 0.0;}
The first accesses to the array (read or write) must be parallelized. DO NOT TOUCH BETWEEN MALLOC AND INIT
When the for loop is parallelized, each thread initializes a range of i Exploits the OS’s first touch policy. Relies on assumption OpenMP maps threads correctly
19
FUTURE TECHNOLOGIES GROUP
New Cache Challenges
shared caches + SPMD programming models can exacerbate conflict misses.
Individually, threads may produce significant cache associativity pressure based on access pattern. (power of 2 problem sizes)
Collectively, threads may produce excessive cache associativity pressure. (power of 2 problem sizes decomposed with a power of two number of threads)
This can be much harder to diagnose and correct This problem arises whether using MPI or a threaded model.
20
FUTURE TECHNOLOGIES GROUP
New Memory Challenges
The number of memory controllers and bandwidth on multicore SMPs is growing much slower than the number of cores.
codes are becoming increasingly memory-bound as a fraction of the cores can saturate a socket’s memory bandwidth
Multicore has traded bit-or word-parallelism for thread-parallelism. However, main memory is still built from bit-parallel devices (DIMMs) Must restructure memory-intensive apps to the bit-parallel nature of
DIMMs (sequential access)
21
FUTURE TECHNOLOGIES GROUP
Synchronization
Using multiple concurrent threads can create ordering and race errors.
Locks are one solution. Must balance granularity and frequency
SPMD programming model + barriers are often a better/simpler solution.
spin barriers can be orders of magnitude faster than pthread library barriers. (Rajesh Nishtala, HotPar’09)
22
FUTURE TECHNOLOGIES GROUP
Performance Modeling and Little’s Law
23
FUTURE TECHNOLOGIES GROUP
System Abstraction
Abstractly describe any system (or subsystem) as a combination of black-boxed storage, computational units, and the bandwidth between them.
These can be hierarchically
composed. A volume of data must be
transferred from the storage
component, processed, and
another volume of data must be returned. Consider the basic parameters governing performance of the
channel: Bandwidth, Latency, Concurrency Bandwidth can be measured in: GB/s, Gflop/s, MIPS, etc… Latency can be measured in: seconds, cycles, etc… Concurrency the volume in flight across the channel, and can be
measured in bytes, cache lines, operations, instructions, etc…
24
DRAM
CPU
Cache
core
RF
FU’s
FUTURE TECHNOLOGIES GROUP
Little’s Law
Little’s law related concurrency, bandwidth, and latency To achieve peak bandwidth, one must satisfy:
Concurrency = Latency × Bandwidth For example, a memory controller with 20GB/s of bandwidth, and 100ns
of latency requires the CPU to express 2KB of concurrency (memory-level parallelism)
Similarly, given an expressed concurrency, one can bound attained performance:
That is, as more concurrency is injected, we get progressively better performance
Note, this assumes continual, pipelined accesses.
25
BWattained = minConcurrencyexpressed / Latency
BWmax
FUTURE TECHNOLOGIES GROUP
Where’s the bottleneck?
We’ve described bandwidths DRAM CPU Cache Core Register File Functional units
But in an application, one of these
may be a performance-limiting
bottleneck.
We can take any pair and compare how quickly data can be transferred to how quickly it can be processed to determine the bottleneck.
26
DRAM
CPU
Cache
core
RF
FU’s
FUTURE TECHNOLOGIES GROUP
27
Arithmetic Intensity
Consider the first case (DRAM-CPU) True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes
Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality), but remains constant on others
Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses.
O( N )O( log(N) )
O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTsDense Linear Algebra
(BLAS3)Particle Methods
FUTURE TECHNOLOGIES GROUP
Kernel Arithmetic Intensityand Architecture
28
For a given architecture, one may calculate its flop:byte ratio. For a 2.3GHz Quad Core Opteron,
1 SIMD add + 1 SIMD multiply per cycle per core 12.8GB/s of DRAM bandwidth = 36.8 / 12.8 ~ 2.9 flops per byte
When a kernel’s arithmetic intensity is substantially
less than the architecture’s flop:byte ratio, transferring
data will take longer than computing on it
memory-bound When a kernel’s arithmetic intensity is substantially greater than the
architecture’s flop:byte ratio, computation will take longer than data transfers
compute-bound
FUTURE TECHNOLOGIES GROUP
29
Memory Traffic Definition
Total bytes to/from DRAM
Can categorize into: Compulsory misses Capacity misses Conflict misses Write allocations …
Oblivious of lack of sub-cache line spatial locality
FUTURE TECHNOLOGIES GROUP
Roofline ModelBasic Concept
30
Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.
where optimization i can be SIMDize, or unroll, or SW prefetch, … Given a kernel’s arithmetic intensity (based on DRAM traffic after
being filtered by the cache), programmers can inspect the figure, and bound performance.
Moreover, provides insights as to which optimizations will potentially be beneficial.
AttainablePerformanceij
= minFLOP/s with Optimizations1-i
AI * Bandwidth with Optimizations1-j
FUTURE TECHNOLOGIES GROUP
Roofline ModelBasic Concept
31
Plot on log-log scale Given AI, we can easily
bound performance But architectures are much
more complicated
We will bound performance as we eliminate specific forms of in-core parallelism
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
FUTURE TECHNOLOGIES GROUP
Roofline Modelcomputational ceilings
32
Opterons have dedicated multipliers and adders.
If the code is dominated by adds, then attainable performance is half of peak.
We call these Ceilings They act like constraints on
performance
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
mul / add imbalance
FUTURE TECHNOLOGIES GROUP
Roofline Modelcomputational ceilings
33
Opterons have 128-bit datapaths.
If instructions aren’t SIMDized, attainable performance will be halved
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
mul / add imbalance
w/out SIMD
FUTURE TECHNOLOGIES GROUP
Roofline Modelcomputational ceilings
34
On Opterons, floating-point instructions have a 4 cycle latency.
If we don’t express 4-way ILP, performance will drop by as much as 4x
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidth
mul / add imbalance
FUTURE TECHNOLOGIES GROUP
Roofline Modelcommunication ceilings
35
We can perform a similar exercise taking away parallelism from the memory subsystem
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
FUTURE TECHNOLOGIES GROUP
Roofline Modelcommunication ceilings
36
Explicit software prefetch instructions are required to achieve peak bandwidth
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
w/out
SW
pre
fetch
FUTURE TECHNOLOGIES GROUP
Roofline Modelcommunication ceilings
37
Opterons are NUMA As such memory traffic
must be correctly balanced among the two sockets to achieve good Stream bandwidth.
We could continue this by examining strided or random memory access patterns
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
w/out
SW
pre
fetch
w/out
NUM
A
FUTURE TECHNOLOGIES GROUP
Roofline Modelcomputation + communication ceilings
38
We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth.
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
peak DP
mul / add imbalance
w/out ILP
Stream
Ban
dwidth
w/out
SW
pre
fetch
w/out
NUM
A
FUTURE TECHNOLOGIES GROUP
Roofline Modellocality walls
39
Remember, memory traffic includes more than just compulsory misses.
As such, actual arithmetic intensity may be substantially lower.
Walls are unique to the architecture-kernel combination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
FLOPs
Compulsory MissesAI =
FUTURE TECHNOLOGIES GROUP
Roofline Modellocality walls
40
Remember, memory traffic includes more than just compulsory misses.
As such, actual arithmetic intensity may be substantially lower.
Walls are unique to the architecture-kernel combination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
+w
rite a
lloca
tion
traffic
FLOPs
Allocations + Compulsory MissesAI =
FUTURE TECHNOLOGIES GROUP
Roofline Modellocality walls
41
Remember, memory traffic includes more than just compulsory misses.
As such, actual arithmetic intensity may be substantially lower.
Walls are unique to the architecture-kernel combination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
+w
rite a
lloca
tion
traffic
+ca
pa
city miss tra
ffic
FLOPs
Capacity + Allocations + CompulsoryAI =
FUTURE TECHNOLOGIES GROUP
Roofline Modellocality walls
42
Remember, memory traffic includes more than just compulsory misses.
As such, actual arithmetic intensity may be substantially lower.
Walls are unique to the architecture-kernel combination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
+w
rite a
lloca
tion
traffic
+ca
pa
city miss tra
ffic
+co
nflict m
iss traffic
FLOPs
Conflict + Capacity + Allocations + CompulsoryAI =
FUTURE TECHNOLOGIES GROUP
43
Optimization Categorization
Maximizing (attained)In-core Performance
Minimizing (total)Memory Traffic
Maximizing (attained)Memory Bandwidth
FUTURE TECHNOLOGIES GROUP
44
Optimization Categorization
MinimizingMemory Traffic
MaximizingMemory Bandwidth
MaximizingIn-core Performance
•Exploit in-core parallelism (ILP, DLP, etc…)
•Good (enough) floating-point balance
FUTURE TECHNOLOGIES GROUP
45
Optimization Categorization
MinimizingMemory Traffic
MaximizingMemory Bandwidth
MaximizingIn-core Performance
•Exploit in-core parallelism (ILP, DLP, etc…)
•Good (enough) floating-point balance
unroll &jam
explicitSIMD
reorder
eliminatebranches
FUTURE TECHNOLOGIES GROUP
unroll &jam
explicitSIMD
reorder
eliminatebranches
46
Optimization Categorization
MaximizingIn-core Performance
MinimizingMemory Traffic
•Exploit in-core parallelism (ILP, DLP, etc…)
•Good (enough) floating-point balance
MaximizingMemory Bandwidth
•Exploit NUMA
•Hide memory latency
•Satisfy Little’s Law
memoryaffinity SW
prefetch
DMAlists
unit-stridestreams
TLBblocking
FUTURE TECHNOLOGIES GROUP
memoryaffinity SW
prefetch
DMAlists
unit-stridestreams
TLBblocking
unroll &jam
explicitSIMD
reorder
eliminatebranches
47
Optimization Categorization
MaximizingIn-core Performance
MaximizingMemory Bandwidth
•Exploit in-core parallelism (ILP, DLP, etc…)
•Good (enough) floating-point balance
•Exploit NUMA
•Hide memory latency
•Satisfy Little’s Law
MinimizingMemory Traffic
Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior
cacheblockingarray
padding
compressdata
streamingstores
FUTURE TECHNOLOGIES GROUP
48
Optimization Categorization
MaximizingIn-core Performance
MinimizingMemory Traffic
MaximizingMemory Bandwidth
•Exploit in-core parallelism (ILP, DLP, etc…)
•Good (enough) floating-point balance
•Exploit NUMA
•Hide memory latency
•Satisfy Little’s Law
memoryaffinity SW
prefetch
DMAlists
unit-stridestreams
TLBblocking
Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior
cacheblockingarray
padding
compressdata
streamingstores
unroll &jam
explicitSIMD
reorder
eliminatebranches
FUTURE TECHNOLOGIES GROUP
Roofline Modellocality walls
49
Optimizations remove these walls and ceilings which act to constrain performance.
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
+w
rite a
lloca
tion
traffic
+ca
pa
city miss tra
ffic
+co
nflict m
iss traffic
FUTURE TECHNOLOGIES GROUP
Roofline Modellocality walls
50
Optimizations remove these walls and ceilings which act to constrain performance.
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
FUTURE TECHNOLOGIES GROUP
Roofline Modellocality walls
51
Optimizations remove these walls and ceilings which act to constrain performance.
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
FUTURE TECHNOLOGIES GROUP
Roofline Modellocality walls
52
Optimizations remove these walls and ceilings which act to constrain performance.
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
FUTURE TECHNOLOGIES GROUP
memoryaffinity SW
prefetch
DMAlists
unit-stridestreams
TLBblocking
cacheblockingarray
padding
compressdata
streamingstores
unroll &jam
explicitSIMD
reorder
eliminatebranches
53
Optimization Categorization
MaximizingIn-core Performance
MinimizingMemory Traffic
MaximizingMemory Bandwidth
•Exploit in-core parallelism (ILP, DLP, etc…)
•Good (enough) floating-point balance
•Exploit NUMA
•Hide memory latency
•Satisfy Little’s Law
Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior
Each optimization has
a large parameter space
What are the optimal parameters?Each optimization has
a large parameter space
What are the optimal parameters?
FUTURE TECHNOLOGIES GROUP
54
Auto-tuning?
Provides performance portability across the existing breadth and evolution of microprocessors
One time up front productivity cost is amortized by the number of machines its used on
Auto-tuning does not invent new optimizations Auto-tuning automates the code generation and exploration of
the optimization and parameter space Two components:
parameterized code generator (we wrote ours in Perl) Auto-tuning exploration benchmark
(combination of heuristics and exhaustive search) Can be extended with ISA specific optimizations (e.g. DMA, SIMD)
FUTURE TECHNOLOGIES GROUP
55
Multicore SMPsof Interest(used throughout the rest of the talk)
FUTURE TECHNOLOGIES GROUP
56
Multicore SMPs Used
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
FUTURE TECHNOLOGIES GROUP
57
Multicore SMPs Used(Conventional cache-based memory hierarchy)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
FUTURE TECHNOLOGIES GROUP
58
Multicore SMPs Used(local store-based memory hierarchy)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
FUTURE TECHNOLOGIES GROUP
59
Multicore SMPs Used(CMT = Chip-MultiThreading)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
FUTURE TECHNOLOGIES GROUP
60
Multicore SMPs Used(threads)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
8 threads 8 threads
16* threads128 threads
*SPEs only
FUTURE TECHNOLOGIES GROUP
61
Multicore SMPs Used(peak double precision flops)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
75 GFlop/s 74 Gflop/s
29* GFlop/s19 GFlop/s
*SPEs only
FUTURE TECHNOLOGIES GROUP
62
Multicore SMPs Used(total DRAM bandwidth)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
21 GB/s (read)
10 GB/s (write)21 GB/s
51 GB/s42 GB/s (read)
21 GB/s (write)
*SPEs only
FUTURE TECHNOLOGIES GROUP
63
Multicore SMPs Used(Non-Uniform Memory Access - NUMA)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
*SPEs only
FUTURE TECHNOLOGIES GROUP
Roofline Modelfor these multicore SMPs
64
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out FMA
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth
on sm
all d
atas
ets
peak DP
Bandw
idth
on la
rge
data
sets
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out SIMD
w/out ILP
w/out FMA
peak DP
Stream
Ban
dwidt
h
misa
ligne
d DM
A
w/out
NUM
A
Note, the multithreaded Niagara is limited by the instruction mix rather than a lack of expressed in-core parallelism
Clearly some architectures are more dependent on bandwidth optimizations while others on in-core optimizations.
0.5
1.0
1/8
actual FLOP:Byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
FUTURE TECHNOLOGIES GROUP
65
Auto-tuning Sparse Matrix-Vector Multiplication (SpMV)
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.
FUTURE TECHNOLOGIES GROUP
66
Sparse MatrixVector Multiplication
What’s a Sparse Matrix ? Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data to reconstruct the matrix structure
What’s SpMV ? Evaluate y=Ax A is a sparse matrix, x & y are dense vectors
Challenges Very low arithmetic intensity (often <0.166 flops/byte) Difficult to exploit ILP (bad for pipelined or superscalar), Difficult to exploit DLP (bad for SIMD)
(a)algebra conceptualization
(c)CSR reference code
for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0;}
A x y
(b)CSR data structure
A.val[ ]
A.rowStart[ ]
...
...
A.col[ ]...
FUTURE TECHNOLOGIES GROUP
67
The Dataset (matrices)
Unlike dense BLAS, performance is dictated by sparsity Suite of 14 matrices All bigger than the caches of our SMPs We’ll also include a median performance number
Dense
ProteinFEM /
SpheresFEM /
CantileverWind
TunnelFEM /Harbor
QCDFEM /Ship
Economics Epidemiology
FEM /Accelerator
Circuit webbase
LP
2K x 2K Dense matrixstored in sparse format
Well Structured(sorted by nonzeros/row)
Poorly Structuredhodgepodge
Extreme Aspect Ratio(linear programming)
FUTURE TECHNOLOGIES GROUP
68
SpMV Performance(simple parallelization)
Out-of-the box SpMV performance on a suite of 14 matrices
Scalability isn’t great Is this performance
good?
Naïve Pthreads
Naïve
FUTURE TECHNOLOGIES GROUP
NUMA for SpMV
On NUMA architectures, all large arrays should be partitioned either explicitly (multiple malloc()’s + affinity) implicitly (parallelize initialization and rely on first touch)
You cannot partition on granularities less than the page size 512 elements on x86 2M elements on Niagara
For SpMV, partition the matrix and
perform multiple malloc()’s Pin submatrices so they are
co-located with the cores tasked
to process them
69
FUTURE TECHNOLOGIES GROUP
Prefetch for SpMV
SW prefetch injects more MLP into the memory subsystem.
Can try to prefetch the values indices source vector or any combination thereof
In general, should only insert one prefetch per cache line (works best on unrolled code)
70
for(all rows){ y0 = 0.0; y1 = 0.0; y2 = 0.0; y3 = 0.0; for(all tiles in this row){ PREFETCH(V+i+PFDistance); y0+=V[i ]*X[C[i]] y1+=V[i+1]*X[C[i]] y2+=V[i+2]*X[C[i]] y3+=V[i+3]*X[C[i]] } y[r+0] = y0; y[r+1] = y1; y[r+2] = y2; y[r+3] = y3;}
FUTURE TECHNOLOGIES GROUP
SpMV Performance(NUMA and Software Prefetching)
71
NUMA-aware allocation is essential on memory-bound NUMA SMPs.
Explicit software prefetching can boost bandwidth and change cache replacement policies
Cell PPEs are likely latency-limited.
used exhaustive search
FUTURE TECHNOLOGIES GROUP
ILP/DLP vs Bandwidth
In the multicore era, which is the bigger issue? a lack of ILP/DLP (a major advantage of BCSR) insufficient memory bandwidth per core
There are many architectures than when running low arithmetic intensity kernels, there is so little available memory bandwidth (per core) that you won’t notice a complete lack of ILP
Perhaps we should concentrate on minimizing memory traffic rather than maximizing ILP/DLP
Rather than benchmarking every combination, just
Select the register blocking that minimizes the matrix foot print.
72
FUTURE TECHNOLOGIES GROUP
SpMV Performance(Matrix Compression)
73
After maximizing memory bandwidth, the only hope is to minimize memory traffic.
exploit: register blocking other formats smaller indices
Use a traffic minimization heuristic rather than search
Benefit is clearly
matrix-dependent. Register blocking enables
efficient software prefetching (one per cache line)
FUTURE TECHNOLOGIES GROUP
Cache blocking for SpMV
Cache-blocking sparse matrices is very different than cache-blocking dense matrices.
Rather than changing loop bounds, store entire submatrices contiguously.
The columns spanned by each cache
block are selected so that all submatrices
place the same pressure on the cache
i.e. touch the same number of unique
source vector cache lines
TLB blocking is a similar concept but
instead of on 8 byte granularities,
it uses 4KB granularities
74
thre
ad 0
thre
ad 1
thre
ad 2
thre
ad 3
FUTURE TECHNOLOGIES GROUP
75
Auto-tuned SpMV Performance(cache and TLB blocking)
Fully auto-tuned SpMV performance across the suite of matrices
Why do some optimizations work better on some architectures?
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
FUTURE TECHNOLOGIES GROUP
76
Auto-tuned SpMV Performance(architecture specific optimizations)
Fully auto-tuned SpMV performance across the suite of matrices
Included SPE/local store optimized version
Why do some optimizations work better on some architectures?
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
FUTURE TECHNOLOGIES GROUP
77
Auto-tuned SpMV Performance(max speedup)
Fully auto-tuned SpMV performance across the suite of matrices
Included SPE/local store optimized version
Why do some optimizations work better on some architectures?
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
2.7x2.7x 4.0x4.0x
2.9x2.9x 35x35x
FUTURE TECHNOLOGIES GROUP
78
Auto-tuned SpMV Performance(architecture specific optimizations)
Fully auto-tuned SpMV performance across the suite of matrices
Included SPE/local store optimized version
Why do some optimizations work better on some architectures?
Performance is better,but is performance good?
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
Auto-tuning resulted in better performance,
but did it result in good performance?
Auto-tuning resulted in better performance,
but did it result in good performance?
FUTURE TECHNOLOGIES GROUP
79
Roofline model for SpMV
Double precision roofline models
In-core optimizations 1..i DRAM optimizations 1..j
FMA is inherent in SpMV (place at bottom)
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out
NUM
Aba
nk c
onflic
ts
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
data
set d
atas
et fit
s in
snoo
p filt
er
GFlopsi,j(AI) = min InCoreGFlopsi
StreamBWj * AI
FUTURE TECHNOLOGIES GROUP
80
Roofline model for SpMV(overlay arithmetic intensity)
Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory
flop:byte < 0.166
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out
NUM
Aba
nk c
onflic
ts
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
data
set d
atas
et fit
s in
snoo
p filt
er
FUTURE TECHNOLOGIES GROUP
81
Roofline model for SpMV(out-of-the-box parallel)
Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory
flop:byte < 0.166 For simplicity: dense
matrix in sparse format1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out
NUM
Aba
nk c
onflic
ts
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
data
set d
atas
et fit
s in
snoo
p filt
er
FUTURE TECHNOLOGIES GROUP
82
Roofline model for SpMV(NUMA & SW prefetch)
compulsory flop:byte ~ 0.166
utilize all memory channels
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out
NUM
Aba
nk c
onflic
ts
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
data
set d
atas
et fit
s in
snoo
p filt
er
FUTURE TECHNOLOGIES GROUP
83
Roofline model for SpMV(matrix compression)
Inherent FMA Register blocking improves
ILP, DLP, flop:byte ratio, and FP% of instructions
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out
NUM
Aba
nk c
onflic
ts
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
data
set d
atas
et fit
s in
snoo
p filt
er
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
FUTURE TECHNOLOGIES GROUP
84
Roofline model for SpMV(matrix compression)
Inherent FMA Register blocking improves
ILP, DLP, flop:byte ratio, and FP% of instructions
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out
NUM
Aba
nk c
onflic
ts
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
data
set f
its in
snoo
p filt
er
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
Perfo
rman
ce is
bandwid
th li
mite
d
Perfo
rman
ce is
bandwid
th li
mite
d
FUTURE TECHNOLOGIES GROUP
85
SpMV Performance(summary)
Median SpMV performance aside, unlike LBMHD, SSE was unnecessary to achieve performance Cell still requires a non-portable, ISA-specific implementation to achieve good
performance. Novel SpMV implementations may require ISA-specific (SSE) code to achieve
better performance.
FUTURE TECHNOLOGIES GROUP
86
Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamics
(LBMHD)Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008.
Best Paper, Application Track
FUTURE TECHNOLOGIES GROUP
87
LBMHD
Plasma turbulence simulation via Lattice Boltzmann Method Two distributions:
momentum distribution (27 scalar components) magnetic distribution (15 vector components)
Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector)
Arithmetic Intensity: Must read 73 doubles, and update 79 doubles per lattice update (1216 bytes) Requires about 1300 floating point operations per lattice update Just over 1.0 flops/byte (ideal)
Cache capacity requirements are independent of problem size Two problem sizes:
643 (0.3 GB) 1283 (2.5 GB)
periodic boundary
conditionsmomentum distribution
14
4
13
16
5
8
9
21
12
+Y
2
25
1
3
24
23
22
26
0
18
6
17
19
7
10
11
20
15
+Z
+X
magnetic distribution
14
13
16
21
12
25
24
23
22
26
18
17
19
20
15
+Y
+Z
+X
macroscopic variables
+Y
+Z
+X
FUTURE TECHNOLOGIES GROUP
88
LBMHD Performance(reference implementation)
Generally, scalability looks good
Scalability is good but is performance good?
*collision() only
Naïve+NUMA
FUTURE TECHNOLOGIES GROUP
LBMHD Performance(lattice-aware array padding)
89
LBMHD touches >150 arrays.
Most caches have limited associativity
Conflict misses are likely Apply heuristic to pad
arrays
+Padding
Naïve+NUMA
FUTURE TECHNOLOGIES GROUP
Vectorization
Two phases with a lattice method’s collision() operator: reconstruction of macroscopic variables updating discretized velocities
Normally this is done one point at a time. Change to do a vector’s worth at a time (loop interchange + tuning)
90
FUTURE TECHNOLOGIES GROUP
91
LBMHD Performance(vectorization)
Restructure loops to attain good TLB page locality and streaming accesses
*collision() only
+Vectorization
+Padding
Naïve+NUMA
FUTURE TECHNOLOGIES GROUP
92
LBMHD Performance(architecture specific optimizations)
Add unrolling and reordering of inner loop
Additionally, it exploits SIMD where the compiler doesn’t
Include a SPE/Local Store optimized version
*collision() only
+Explicit SIMDization
+SW Prefetching
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
+small pages
FUTURE TECHNOLOGIES GROUP
93
LBMHD Performance(architecture specific optimizations)
Add unrolling and reordering of inner loop
Additionally, it exploits SIMD where the compiler doesn’t
Include a SPE/Local Store optimized version
*collision() only
+Explicit SIMDization
+SW Prefetching
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
+small pages
1.6x1.6x 4x4x
3x3x 130x130x
FUTURE TECHNOLOGIES GROUP
94
Roofline model for LBMHD
Far more adds than multiplies (imbalance)
Huge data sets
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out
NUM
Aba
nk c
onflic
ts
peak DP
mul/add imbalance
data
set f
its in
snoo
p filt
er
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
w/out SIMD
w/out ILP
FUTURE TECHNOLOGIES GROUP
95
Roofline model for LBMHD(overlay arithmetic intensity)
Far more adds than multiplies (imbalance)
Essentially random access to memory
Flop:byte ratio ~0.7 NUMA allocation/access Little ILP No DLP High conflict misses
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out
NUM
Aba
nk c
onflic
ts
peak DP
mul/add imbalance
data
set f
its in
snoo
p filt
er
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
w/out SIMD
w/out ILP
FUTURE TECHNOLOGIES GROUP
96
Roofline model for LBMHD(out-of-the-box parallel performance)
Far more adds than multiplies (imbalance)
Essentially random access to memory
Flop:byte ratio ~0.7 NUMA allocation/access Little ILP No DLP High conflict misses
Peak VF performance with 64 threads (out of 128) - high conflict misses
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out
NUM
Aba
nk c
onflic
ts
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Sun T2+ T5140(Victoria Falls)
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
peak DP
mul/add imbalance
data
set f
its in
snoo
p filt
er
Intel Xeon E5345(Clovertown)
w/out SIMD
w/out ILP
FUTURE TECHNOLOGIES GROUP
97
Roofline model for LBMHD(Padding, Vectorization, Unrolling, Reordering, …)
Vectorize the code to eliminate TLB capacity misses
Ensures unit stride access (bottom bandwidth ceiling)
Tune for optimal VL Clovertown pinned to
lower BW ceiling1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out
NUM
Aba
nk c
onflic
ts
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Sun T2+ T5140(Victoria Falls)
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
peak DP
mul/add imbalance
data
set f
its in
snoo
p filt
er
Intel Xeon E5345(Clovertown)
w/out SIMD
w/out ILP
FUTURE TECHNOLOGIES GROUP
98
Roofline model for LBMHD(SIMDization + cache bypass)
Make SIMDization explicit Technically, this swaps ILP
and SIMD ceilings Use cache bypass
instruction: movntpd Increases flop:byte ratio to
~1.0 on x86/Cell
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out
NUM
Aba
nk c
onflic
ts
peak DP
w/out ILP
w/out SIMD
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Sun T2+ T5140(Victoria Falls)
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
peak DP
mul/add imbalance
data
set f
its in
snoo
p filt
er
Intel Xeon E5345(Clovertown)
w/out ILP
w/out SIMD
FUTURE TECHNOLOGIES GROUP
99
Roofline model for LBMHD(SIMDization + cache bypass)
Make SIMDization explicit Technically, this swaps ILP
and SIMD ceilings Use cache bypass
instruction: movntpd Increases flop:byte ratio to
~1.0 on x86/Cell
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out
NUM
Aba
nk c
onflic
ts
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Sun T2+ T5140(Victoria Falls)
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
peak DP
mul/add imbalance
data
set f
its in
snoo
p filt
er
Intel Xeon E5345(Clovertown)
w/out SIMD
w/out ILP
3 out o
f 4 m
achin
es
hit th
e Roofli
ne
3 out o
f 4 m
achin
es
hit th
e Roofli
ne
FUTURE TECHNOLOGIES GROUP
LBMHD Performance(Summary)
100
Reference code is clearly insufficient Portable C code is insufficient on Barcelona and Cell Cell gets all its performance from the SPEs
despite only 2x the area, and 2x the peak DP FLOPs
FUTURE TECHNOLOGIES GROUP
101
Summary
FUTURE TECHNOLOGIES GROUP
Summary
102
Introduced the Roofline Model Apply bound and bottleneck analysis Performance and requisite optimizations are inferred visually
Extended auto-tuning to multicore Fundamentally different from running auto-tuned serial code on multicore SMPs. Apply the concept to LBMHD and SpMV.
Auto-tuning LBMHD and SpMV Multicore has had a transformative effect on auto-tuning.
(move from latency limited to bandwidth limited) Maximizing memory bandwidth and minimizing memory traffic is key. Compilers are reasonably effective at in-core optimizations,
but totally ineffective at cache and memory issues. Library or framework is a necessity in managing these issues.
Comments on architecture Ultimately machines are bandwidth-limited without new algorithms Architectures with caches required significantly more tuning than the local store-
based Cell
FUTURE TECHNOLOGIES GROUP
103
Acknowledgements
Research supported by: Microsoft and Intel funding (Award #20080469) DOE Office of Science under contract number DE-AC02-05CH11231 NSF contract CNS-0325873 Sun Microsystems - Niagara2 / Victoria Falls machines AMD - access to Quad-core Opteron (barcelona) access Forschungszentrum Jülich - access to QS20 Cell blades IBM - virtual loaner program to QS20 Cell blades
FUTURE TECHNOLOGIES GROUP
104
Questions ?Samuel Williams, Andrew Waterman, David Patterson, "Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures", Communications of the ACM (CACM), April 2009.
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.
Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008.
Best Paper, Application Track