Top Banner
FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams [email protected]
104

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams [email protected].

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

1

Auto-tuning Memory Intensive Kernels for Multicore

Sam Williams [email protected]

Page 2: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Outline

1. Challenges arising from Optimizing Single Thread Performance

2. New Challenges Arising when Optimizing Multicore SMP Performance

3. Performance Modeling and Little’s Law

4. Multicore SMPs of Interest

5. Auto-tuning Sparse Matrix-Vector Multiplication (SpMV)

6. Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD)

7. Summary

2

Page 3: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Challenges arising fromOptimizing Single Thread

Performance

3

Page 4: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Instruction-Level Parallelism

On modern pipelined architectures, operations (like floating-point addition) have a latency of 4-6 cycles (until the result is ready).

However, independent adds can be pipelined one after another. Although this increases the peak flop rate,

one can only achieve peak flops on the condition that on any given cycle the program has >4 independent adds ready to execute.

failing to do so will result in a >4x drop in performance. The problem is exacerbated by superscalar or VLIW architectures

like POWER or Itanium.

One must often reorganize kernels to express more instruction-level parallelism

4

Page 5: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

1x1 Register Block

ILP Example (1x1 BCSR)

for(all rows){ y0 = 0.0; for(all tiles in this row){ y0+=V[i]*X[C[i]] } y[r] = y0;}

Consider the core of SpMV No ILP in the inner loop OOO can’t accelerate serial FMAs

FMAFMA

time

=0

FMAFMA

time

=4

FMAFMA

time

=8

FMAFMA

time

=1

2

FMAFMA

time

=1

6

Page 6: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

ILP Example (1x4 BCSR)

for(all rows){ y0 = 0.0; for(all tiles in this row){ y0+=V[i ]*X[C[i] ] y0+=V[i+1]*X[C[i]+1] y0+=V[i+2]*X[C[i]+2] y0+=V[i+3]*X[C[i]+3] } y[r] = y0;}

What about 1x4 BCSR ? Still no ILP in the inner loop FMAs are still dependent on each other

FMAFMA

time

=0

FMAFMA

time

=4

FMAFMA

time

=8

FMAFMA

time

=1

2

1x4 Register Block

Page 7: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

ILP Example (4x1 BCSR)

for(all rows){ y0 = 0.0;y1 = 0.0; y2 = 0.0;y3 = 0.0; for(all tiles in this row){ y0+=V[i ]*X[C[i]] y1+=V[i+1]*X[C[i]] y2+=V[i+2]*X[C[i]] y3+=V[i+3]*X[C[i]] } y[r+0] = y0; y[r+1] = y1; y[r+2] = y2; y[r+3] = y3;}

What about 4x1 BCSR ? Updating 4 different rows The 4 FMAs are independent Thus they can be pipelined.

FMAFMA

time

=0

FMAFMA

time

=1

FMAFMA

time

=2

FMAFMA

time

=3

4x1 Register Block

FMAFMA

time

=4

FMAFMA

time

=5

FMAFMA

time

=6

FMAFMA

time

=7

Page 8: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Data-level Parallelism

DLP = apply the same operation to multiple independent operands.

Today, rather than relying on superscalar issue, many architectures have adopted SIMD as an efficient means of boosting peak performance. (SSE, Double Hummer, AltiVec, Cell, GPUs, etc…)

Typically these instructions operate on four single precision

(or two double precision) numbers at a time. However, some are more GPUs(32), Larrabee(16), and AVX(8) Failing to use these instructions may cause a 2-32x drop in

performance Unfortunately, most compilers utterly fail to generate these

instructions.8

++ ++ ++ ++

Page 9: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Memory-Level Parallelism (1)

Although caches may filter many memory requests, in HPC many memory references will still go all the way to DRAM.

Memory latency (as measured in core cycles) grew by an order of magnitude in the 90’s

Today, the latency of a memory operation can exceed 200 cycles (1 double every 80ns is unacceptably slow).

Like ILP, we wish to pipeline requests to DRAM Several solutions exist today

HW stream prefetchers HW Multithreading (e.g. hyperthreading) SW line prefetch DMA

9

Page 10: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Memory-Level Parallelism (2)

HW stream prefetchers are by far the easiest to implement and exploit.

They detect a series of consecutive cache misses and speculate that the next addresses in the series will be needed. They then prefetch that data into the cache or a dedicated buffer.

To effectively exploit a HW prefetcher, ensure your array references accesses 100’s of consecutive addresses.

e.g. read A[i]…A[i+255] without any jumps or discontinuities

This force limits the effectiveness (shape) of the cache blocking you implemented in HW1 as you accessed:

A[(j+0)*N+i]…A[(j+0)*N+i+B], jumpA[(j+1)*N+i]…A[(j+1)*N+i+B], jumpA[(j+2)*N+i]…A[(j+2)*N+i+B], jump…

10

Page 11: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Branch Misprediction

A mispredicted branch can stall subsequent instructions by ~10 cycles.

Select a loop structure that maximizes the loop length

(keeps mispredicted branches per instruction to a minimum)

Some architectures support predication either in hardware or software to eliminate branches (transforms control dependencies into data dependencies)

11

Page 12: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Cache Subtleties

Set associative caches have a limited number of sets (S) and

ways (W), the product of which is the capacity (in cache lines). As seen in HW1, it can be beneficial to reorganize kernels to

reduce the working size and eliminate capacity misses.

Conflict misses can severely impair performance, be very challenging to identify and eliminate.

Given address may only be placed in W different locations in the cache.

Poor access patterns or roughly power of two problem sizes can be especially bad

Results in too many addresses mapped to the same set. Not all of them can be kept in the cache and some will have to be evicted.

Padding arrays (problem sizes) or skewing access pattern can eliminate conflict misses.

12

Page 13: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Array padding Example

Padding changes the data layout Consider a large matrix with a power of two number of

double A[N][M];// column major with M~pow2 A[i][j] and A[i+1][j] will likely be mapped to the same set.

We can pad each column with a couple extra rows

double A[N][M+pad];

Such techniques are applicable in many other domains (stencils, lattice-boltzman methods, etc…)

13

Page 14: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

New Challenges Arising whenOptimizing Multicore SMP Performance

14

Page 15: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

What are SMPs ?

SMP = shared memory parallel. Multiple chips (typically < 32 threads) can address any location in a

large shared memory through a network or bus Caches are almost universally coherent You can still run MPI on an SMP, but

you trade free (always pay for it) cache-coherency traffic for additional memory traffic (for explicit communication)

you trade user-level function calls for system calls Alternately, you use a SPMD threading model

(pthreads, OpenMP, UPC)

If communication between cores or threads is significant, then threaded implementations win out.

As computation:communication ratio increases, MPI asymptotically approached threaded implementations.

15

Page 16: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

What is multicore ?What are multicore SMPs ?

16

Today, multiple cores are integrated on the same chip Almost universally this is done in a SMP fashion For “convince”, programming multicore SMPs is indistinguishable

from programming multi-socket SMPs. (easy transition)

Multiple cores can share: memory controllers caches occasionally FPUs

Although there was a graceful transition

from multiple sockets to multiple cores

from the point of view of correctness,

achieving good performance can be

incredibly challenging.

Page 17: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Affinity

17

0 12 34 56 7 0 12 3 4 5 6 7

We may wish one pair of threads to share a cache, but be disjoint from another pair of threads.

We can control the mapping of threads to linux processors via #include<sched.h> + sched_set/getaffinity()

But, mapping of linux processors to physical cores/sockets is machine/OS dependent.

Inspect /proc/cpuinfo or use PLPA

Page 18: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

NUMA Challenges

18

Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among

between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory

Page 19: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Implicit allocation for NUMA

Consider an OpenMP example for implicitly NUMA initialization:

#pragma omp parallel for for (j=0; j<N; j++) { a[j] = 1.0; b[j] = 2.0; c[j] = 0.0;}

The first accesses to the array (read or write) must be parallelized. DO NOT TOUCH BETWEEN MALLOC AND INIT

When the for loop is parallelized, each thread initializes a range of i Exploits the OS’s first touch policy. Relies on assumption OpenMP maps threads correctly

19

Page 20: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

New Cache Challenges

shared caches + SPMD programming models can exacerbate conflict misses.

Individually, threads may produce significant cache associativity pressure based on access pattern. (power of 2 problem sizes)

Collectively, threads may produce excessive cache associativity pressure. (power of 2 problem sizes decomposed with a power of two number of threads)

This can be much harder to diagnose and correct This problem arises whether using MPI or a threaded model.

20

Page 21: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

New Memory Challenges

The number of memory controllers and bandwidth on multicore SMPs is growing much slower than the number of cores.

codes are becoming increasingly memory-bound as a fraction of the cores can saturate a socket’s memory bandwidth

Multicore has traded bit-or word-parallelism for thread-parallelism. However, main memory is still built from bit-parallel devices (DIMMs) Must restructure memory-intensive apps to the bit-parallel nature of

DIMMs (sequential access)

21

Page 22: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Synchronization

Using multiple concurrent threads can create ordering and race errors.

Locks are one solution. Must balance granularity and frequency

SPMD programming model + barriers are often a better/simpler solution.

spin barriers can be orders of magnitude faster than pthread library barriers. (Rajesh Nishtala, HotPar’09)

22

Page 23: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Performance Modeling and Little’s Law

23

Page 24: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

System Abstraction

Abstractly describe any system (or subsystem) as a combination of black-boxed storage, computational units, and the bandwidth between them.

These can be hierarchically

composed. A volume of data must be

transferred from the storage

component, processed, and

another volume of data must be returned. Consider the basic parameters governing performance of the

channel: Bandwidth, Latency, Concurrency Bandwidth can be measured in: GB/s, Gflop/s, MIPS, etc… Latency can be measured in: seconds, cycles, etc… Concurrency the volume in flight across the channel, and can be

measured in bytes, cache lines, operations, instructions, etc…

24

DRAM

CPU

Cache

core

RF

FU’s

Page 25: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Little’s Law

Little’s law related concurrency, bandwidth, and latency To achieve peak bandwidth, one must satisfy:

Concurrency = Latency × Bandwidth For example, a memory controller with 20GB/s of bandwidth, and 100ns

of latency requires the CPU to express 2KB of concurrency (memory-level parallelism)

Similarly, given an expressed concurrency, one can bound attained performance:

That is, as more concurrency is injected, we get progressively better performance

Note, this assumes continual, pipelined accesses.

25

BWattained = minConcurrencyexpressed / Latency

BWmax

Page 26: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Where’s the bottleneck?

We’ve described bandwidths DRAM CPU Cache Core Register File Functional units

But in an application, one of these

may be a performance-limiting

bottleneck.

We can take any pair and compare how quickly data can be transferred to how quickly it can be processed to determine the bottleneck.

26

DRAM

CPU

Cache

core

RF

FU’s

Page 27: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

27

Arithmetic Intensity

Consider the first case (DRAM-CPU) True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes

Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality), but remains constant on others

Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses.

O( N )O( log(N) )

O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTsDense Linear Algebra

(BLAS3)Particle Methods

Page 28: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Kernel Arithmetic Intensityand Architecture

28

For a given architecture, one may calculate its flop:byte ratio. For a 2.3GHz Quad Core Opteron,

1 SIMD add + 1 SIMD multiply per cycle per core 12.8GB/s of DRAM bandwidth = 36.8 / 12.8 ~ 2.9 flops per byte

When a kernel’s arithmetic intensity is substantially

less than the architecture’s flop:byte ratio, transferring

data will take longer than computing on it

memory-bound When a kernel’s arithmetic intensity is substantially greater than the

architecture’s flop:byte ratio, computation will take longer than data transfers

compute-bound

Page 29: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

29

Memory Traffic Definition

Total bytes to/from DRAM

Can categorize into: Compulsory misses Capacity misses Conflict misses Write allocations …

Oblivious of lack of sub-cache line spatial locality

Page 30: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline ModelBasic Concept

30

Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.

where optimization i can be SIMDize, or unroll, or SW prefetch, … Given a kernel’s arithmetic intensity (based on DRAM traffic after

being filtered by the cache), programmers can inspect the figure, and bound performance.

Moreover, provides insights as to which optimizations will potentially be beneficial.

AttainablePerformanceij

= minFLOP/s with Optimizations1-i

AI * Bandwidth with Optimizations1-j

Page 31: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline ModelBasic Concept

31

Plot on log-log scale Given AI, we can easily

bound performance But architectures are much

more complicated

We will bound performance as we eliminate specific forms of in-core parallelism

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

Page 32: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modelcomputational ceilings

32

Opterons have dedicated multipliers and adders.

If the code is dominated by adds, then attainable performance is half of peak.

We call these Ceilings They act like constraints on

performance

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

mul / add imbalance

Page 33: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modelcomputational ceilings

33

Opterons have 128-bit datapaths.

If instructions aren’t SIMDized, attainable performance will be halved

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

mul / add imbalance

w/out SIMD

Page 34: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modelcomputational ceilings

34

On Opterons, floating-point instructions have a 4 cycle latency.

If we don’t express 4-way ILP, performance will drop by as much as 4x

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidth

mul / add imbalance

Page 35: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modelcommunication ceilings

35

We can perform a similar exercise taking away parallelism from the memory subsystem

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

Page 36: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modelcommunication ceilings

36

Explicit software prefetch instructions are required to achieve peak bandwidth

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out

SW

pre

fetch

Page 37: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modelcommunication ceilings

37

Opterons are NUMA As such memory traffic

must be correctly balanced among the two sockets to achieve good Stream bandwidth.

We could continue this by examining strided or random memory access patterns

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out

SW

pre

fetch

w/out

NUM

A

Page 38: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modelcomputation + communication ceilings

38

We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

peak DP

mul / add imbalance

w/out ILP

Stream

Ban

dwidth

w/out

SW

pre

fetch

w/out

NUM

A

Page 39: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modellocality walls

39

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

FLOPs

Compulsory MissesAI =

Page 40: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modellocality walls

40

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

FLOPs

Allocations + Compulsory MissesAI =

Page 41: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modellocality walls

41

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

+ca

pa

city miss tra

ffic

FLOPs

Capacity + Allocations + CompulsoryAI =

Page 42: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modellocality walls

42

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

+ca

pa

city miss tra

ffic

+co

nflict m

iss traffic

FLOPs

Conflict + Capacity + Allocations + CompulsoryAI =

Page 43: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

43

Optimization Categorization

Maximizing (attained)In-core Performance

Minimizing (total)Memory Traffic

Maximizing (attained)Memory Bandwidth

Page 44: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

44

Optimization Categorization

MinimizingMemory Traffic

MaximizingMemory Bandwidth

MaximizingIn-core Performance

•Exploit in-core parallelism (ILP, DLP, etc…)

•Good (enough) floating-point balance

Page 45: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

45

Optimization Categorization

MinimizingMemory Traffic

MaximizingMemory Bandwidth

MaximizingIn-core Performance

•Exploit in-core parallelism (ILP, DLP, etc…)

•Good (enough) floating-point balance

unroll &jam

explicitSIMD

reorder

eliminatebranches

Page 46: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

unroll &jam

explicitSIMD

reorder

eliminatebranches

46

Optimization Categorization

MaximizingIn-core Performance

MinimizingMemory Traffic

•Exploit in-core parallelism (ILP, DLP, etc…)

•Good (enough) floating-point balance

MaximizingMemory Bandwidth

•Exploit NUMA

•Hide memory latency

•Satisfy Little’s Law

memoryaffinity SW

prefetch

DMAlists

unit-stridestreams

TLBblocking

Page 47: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

memoryaffinity SW

prefetch

DMAlists

unit-stridestreams

TLBblocking

unroll &jam

explicitSIMD

reorder

eliminatebranches

47

Optimization Categorization

MaximizingIn-core Performance

MaximizingMemory Bandwidth

•Exploit in-core parallelism (ILP, DLP, etc…)

•Good (enough) floating-point balance

•Exploit NUMA

•Hide memory latency

•Satisfy Little’s Law

MinimizingMemory Traffic

Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior

cacheblockingarray

padding

compressdata

streamingstores

Page 48: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

48

Optimization Categorization

MaximizingIn-core Performance

MinimizingMemory Traffic

MaximizingMemory Bandwidth

•Exploit in-core parallelism (ILP, DLP, etc…)

•Good (enough) floating-point balance

•Exploit NUMA

•Hide memory latency

•Satisfy Little’s Law

memoryaffinity SW

prefetch

DMAlists

unit-stridestreams

TLBblocking

Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior

cacheblockingarray

padding

compressdata

streamingstores

unroll &jam

explicitSIMD

reorder

eliminatebranches

Page 49: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modellocality walls

49

Optimizations remove these walls and ceilings which act to constrain performance.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

+ca

pa

city miss tra

ffic

+co

nflict m

iss traffic

Page 50: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modellocality walls

50

Optimizations remove these walls and ceilings which act to constrain performance.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

Page 51: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modellocality walls

51

Optimizations remove these walls and ceilings which act to constrain performance.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

Page 52: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modellocality walls

52

Optimizations remove these walls and ceilings which act to constrain performance.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

Page 53: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

memoryaffinity SW

prefetch

DMAlists

unit-stridestreams

TLBblocking

cacheblockingarray

padding

compressdata

streamingstores

unroll &jam

explicitSIMD

reorder

eliminatebranches

53

Optimization Categorization

MaximizingIn-core Performance

MinimizingMemory Traffic

MaximizingMemory Bandwidth

•Exploit in-core parallelism (ILP, DLP, etc…)

•Good (enough) floating-point balance

•Exploit NUMA

•Hide memory latency

•Satisfy Little’s Law

Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior

Each optimization has

a large parameter space

What are the optimal parameters?Each optimization has

a large parameter space

What are the optimal parameters?

Page 54: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

54

Auto-tuning?

Provides performance portability across the existing breadth and evolution of microprocessors

One time up front productivity cost is amortized by the number of machines its used on

Auto-tuning does not invent new optimizations Auto-tuning automates the code generation and exploration of

the optimization and parameter space Two components:

parameterized code generator (we wrote ours in Perl) Auto-tuning exploration benchmark

(combination of heuristics and exhaustive search) Can be extended with ISA specific optimizations (e.g. DMA, SIMD)

Page 55: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

55

Multicore SMPsof Interest(used throughout the rest of the talk)

Page 56: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

56

Multicore SMPs Used

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

Page 57: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

57

Multicore SMPs Used(Conventional cache-based memory hierarchy)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

Page 58: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

58

Multicore SMPs Used(local store-based memory hierarchy)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

Page 59: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

59

Multicore SMPs Used(CMT = Chip-MultiThreading)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

Page 60: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

60

Multicore SMPs Used(threads)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

8 threads 8 threads

16* threads128 threads

*SPEs only

Page 61: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

61

Multicore SMPs Used(peak double precision flops)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

75 GFlop/s 74 Gflop/s

29* GFlop/s19 GFlop/s

*SPEs only

Page 62: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

62

Multicore SMPs Used(total DRAM bandwidth)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

21 GB/s (read)

10 GB/s (write)21 GB/s

51 GB/s42 GB/s (read)

21 GB/s (write)

*SPEs only

Page 63: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

63

Multicore SMPs Used(Non-Uniform Memory Access - NUMA)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

*SPEs only

Page 64: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Roofline Modelfor these multicore SMPs

64

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out FMA

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth

on sm

all d

atas

ets

peak DP

Bandw

idth

on la

rge

data

sets

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out SIMD

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

misa

ligne

d DM

A

w/out

NUM

A

Note, the multithreaded Niagara is limited by the instruction mix rather than a lack of expressed in-core parallelism

Clearly some architectures are more dependent on bandwidth optimizations while others on in-core optimizations.

0.5

1.0

1/8

actual FLOP:Byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

Page 65: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

65

Auto-tuning Sparse Matrix-Vector Multiplication (SpMV)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

Page 66: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

66

Sparse MatrixVector Multiplication

What’s a Sparse Matrix ? Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data to reconstruct the matrix structure

What’s SpMV ? Evaluate y=Ax A is a sparse matrix, x & y are dense vectors

Challenges Very low arithmetic intensity (often <0.166 flops/byte) Difficult to exploit ILP (bad for pipelined or superscalar), Difficult to exploit DLP (bad for SIMD)

(a)algebra conceptualization

(c)CSR reference code

for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0;}

A x y

(b)CSR data structure

A.val[ ]

A.rowStart[ ]

...

...

A.col[ ]...

Page 67: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

67

The Dataset (matrices)

Unlike dense BLAS, performance is dictated by sparsity Suite of 14 matrices All bigger than the caches of our SMPs We’ll also include a median performance number

Dense

ProteinFEM /

SpheresFEM /

CantileverWind

TunnelFEM /Harbor

QCDFEM /Ship

Economics Epidemiology

FEM /Accelerator

Circuit webbase

LP

2K x 2K Dense matrixstored in sparse format

Well Structured(sorted by nonzeros/row)

Poorly Structuredhodgepodge

Extreme Aspect Ratio(linear programming)

Page 68: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

68

SpMV Performance(simple parallelization)

Out-of-the box SpMV performance on a suite of 14 matrices

Scalability isn’t great Is this performance

good?

Naïve Pthreads

Naïve

Page 69: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

NUMA for SpMV

On NUMA architectures, all large arrays should be partitioned either explicitly (multiple malloc()’s + affinity) implicitly (parallelize initialization and rely on first touch)

You cannot partition on granularities less than the page size 512 elements on x86 2M elements on Niagara

For SpMV, partition the matrix and

perform multiple malloc()’s Pin submatrices so they are

co-located with the cores tasked

to process them

69

Page 70: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Prefetch for SpMV

SW prefetch injects more MLP into the memory subsystem.

Can try to prefetch the values indices source vector or any combination thereof

In general, should only insert one prefetch per cache line (works best on unrolled code)

70

for(all rows){ y0 = 0.0; y1 = 0.0; y2 = 0.0; y3 = 0.0; for(all tiles in this row){ PREFETCH(V+i+PFDistance); y0+=V[i ]*X[C[i]] y1+=V[i+1]*X[C[i]] y2+=V[i+2]*X[C[i]] y3+=V[i+3]*X[C[i]] } y[r+0] = y0; y[r+1] = y1; y[r+2] = y2; y[r+3] = y3;}

Page 71: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

SpMV Performance(NUMA and Software Prefetching)

71

NUMA-aware allocation is essential on memory-bound NUMA SMPs.

Explicit software prefetching can boost bandwidth and change cache replacement policies

Cell PPEs are likely latency-limited.

used exhaustive search

Page 72: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

ILP/DLP vs Bandwidth

In the multicore era, which is the bigger issue? a lack of ILP/DLP (a major advantage of BCSR) insufficient memory bandwidth per core

There are many architectures than when running low arithmetic intensity kernels, there is so little available memory bandwidth (per core) that you won’t notice a complete lack of ILP

Perhaps we should concentrate on minimizing memory traffic rather than maximizing ILP/DLP

Rather than benchmarking every combination, just

Select the register blocking that minimizes the matrix foot print.

72

Page 73: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

SpMV Performance(Matrix Compression)

73

After maximizing memory bandwidth, the only hope is to minimize memory traffic.

exploit: register blocking other formats smaller indices

Use a traffic minimization heuristic rather than search

Benefit is clearly

matrix-dependent. Register blocking enables

efficient software prefetching (one per cache line)

Page 74: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Cache blocking for SpMV

Cache-blocking sparse matrices is very different than cache-blocking dense matrices.

Rather than changing loop bounds, store entire submatrices contiguously.

The columns spanned by each cache

block are selected so that all submatrices

place the same pressure on the cache

i.e. touch the same number of unique

source vector cache lines

TLB blocking is a similar concept but

instead of on 8 byte granularities,

it uses 4KB granularities

74

thre

ad 0

thre

ad 1

thre

ad 2

thre

ad 3

Page 75: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

75

Auto-tuned SpMV Performance(cache and TLB blocking)

Fully auto-tuned SpMV performance across the suite of matrices

Why do some optimizations work better on some architectures?

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 76: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

76

Auto-tuned SpMV Performance(architecture specific optimizations)

Fully auto-tuned SpMV performance across the suite of matrices

Included SPE/local store optimized version

Why do some optimizations work better on some architectures?

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 77: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

77

Auto-tuned SpMV Performance(max speedup)

Fully auto-tuned SpMV performance across the suite of matrices

Included SPE/local store optimized version

Why do some optimizations work better on some architectures?

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

2.7x2.7x 4.0x4.0x

2.9x2.9x 35x35x

Page 78: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

78

Auto-tuned SpMV Performance(architecture specific optimizations)

Fully auto-tuned SpMV performance across the suite of matrices

Included SPE/local store optimized version

Why do some optimizations work better on some architectures?

Performance is better,but is performance good?

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Auto-tuning resulted in better performance,

but did it result in good performance?

Auto-tuning resulted in better performance,

but did it result in good performance?

Page 79: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

79

Roofline model for SpMV

Double precision roofline models

In-core optimizations 1..i DRAM optimizations 1..j

FMA is inherent in SpMV (place at bottom)

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

data

set d

atas

et fit

s in

snoo

p filt

er

GFlopsi,j(AI) = min InCoreGFlopsi

StreamBWj * AI

Page 80: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

80

Roofline model for SpMV(overlay arithmetic intensity)

Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory

flop:byte < 0.166

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

data

set d

atas

et fit

s in

snoo

p filt

er

Page 81: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

81

Roofline model for SpMV(out-of-the-box parallel)

Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory

flop:byte < 0.166 For simplicity: dense

matrix in sparse format1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

data

set d

atas

et fit

s in

snoo

p filt

er

Page 82: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

82

Roofline model for SpMV(NUMA & SW prefetch)

compulsory flop:byte ~ 0.166

utilize all memory channels

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

data

set d

atas

et fit

s in

snoo

p filt

er

Page 83: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

83

Roofline model for SpMV(matrix compression)

Inherent FMA Register blocking improves

ILP, DLP, flop:byte ratio, and FP% of instructions

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

data

set d

atas

et fit

s in

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

Page 84: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

84

Roofline model for SpMV(matrix compression)

Inherent FMA Register blocking improves

ILP, DLP, flop:byte ratio, and FP% of instructions

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

data

set f

its in

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

Perfo

rman

ce is

bandwid

th li

mite

d

Perfo

rman

ce is

bandwid

th li

mite

d

Page 85: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

85

SpMV Performance(summary)

Median SpMV performance aside, unlike LBMHD, SSE was unnecessary to achieve performance Cell still requires a non-portable, ISA-specific implementation to achieve good

performance. Novel SpMV implementations may require ISA-specific (SSE) code to achieve

better performance.

Page 86: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

86

Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamics

(LBMHD)Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008.

Best Paper, Application Track

Page 87: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

87

LBMHD

Plasma turbulence simulation via Lattice Boltzmann Method Two distributions:

momentum distribution (27 scalar components) magnetic distribution (15 vector components)

Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector)

Arithmetic Intensity: Must read 73 doubles, and update 79 doubles per lattice update (1216 bytes) Requires about 1300 floating point operations per lattice update Just over 1.0 flops/byte (ideal)

Cache capacity requirements are independent of problem size Two problem sizes:

643 (0.3 GB) 1283 (2.5 GB)

periodic boundary

conditionsmomentum distribution

14

4

13

16

5

8

9

21

12

+Y

2

25

1

3

24

23

22

26

0

18

6

17

19

7

10

11

20

15

+Z

+X

magnetic distribution

14

13

16

21

12

25

24

23

22

26

18

17

19

20

15

+Y

+Z

+X

macroscopic variables

+Y

+Z

+X

Page 88: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

88

LBMHD Performance(reference implementation)

Generally, scalability looks good

Scalability is good but is performance good?

*collision() only

Naïve+NUMA

Page 89: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

LBMHD Performance(lattice-aware array padding)

89

LBMHD touches >150 arrays.

Most caches have limited associativity

Conflict misses are likely Apply heuristic to pad

arrays

+Padding

Naïve+NUMA

Page 90: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Vectorization

Two phases with a lattice method’s collision() operator: reconstruction of macroscopic variables updating discretized velocities

Normally this is done one point at a time. Change to do a vector’s worth at a time (loop interchange + tuning)

90

Page 91: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

91

LBMHD Performance(vectorization)

Restructure loops to attain good TLB page locality and streaming accesses

*collision() only

+Vectorization

+Padding

Naïve+NUMA

Page 92: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

92

LBMHD Performance(architecture specific optimizations)

Add unrolling and reordering of inner loop

Additionally, it exploits SIMD where the compiler doesn’t

Include a SPE/Local Store optimized version

*collision() only

+Explicit SIMDization

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Naïve+NUMA

+small pages

Page 93: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

93

LBMHD Performance(architecture specific optimizations)

Add unrolling and reordering of inner loop

Additionally, it exploits SIMD where the compiler doesn’t

Include a SPE/Local Store optimized version

*collision() only

+Explicit SIMDization

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Naïve+NUMA

+small pages

1.6x1.6x 4x4x

3x3x 130x130x

Page 94: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

94

Roofline model for LBMHD

Far more adds than multiplies (imbalance)

Huge data sets

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aba

nk c

onflic

ts

peak DP

mul/add imbalance

data

set f

its in

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

w/out SIMD

w/out ILP

Page 95: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

95

Roofline model for LBMHD(overlay arithmetic intensity)

Far more adds than multiplies (imbalance)

Essentially random access to memory

Flop:byte ratio ~0.7 NUMA allocation/access Little ILP No DLP High conflict misses

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aba

nk c

onflic

ts

peak DP

mul/add imbalance

data

set f

its in

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

w/out SIMD

w/out ILP

Page 96: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

96

Roofline model for LBMHD(out-of-the-box parallel performance)

Far more adds than multiplies (imbalance)

Essentially random access to memory

Flop:byte ratio ~0.7 NUMA allocation/access Little ILP No DLP High conflict misses

Peak VF performance with 64 threads (out of 128) - high conflict misses

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aba

nk c

onflic

ts

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Sun T2+ T5140(Victoria Falls)

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

peak DP

mul/add imbalance

data

set f

its in

snoo

p filt

er

Intel Xeon E5345(Clovertown)

w/out SIMD

w/out ILP

Page 97: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

97

Roofline model for LBMHD(Padding, Vectorization, Unrolling, Reordering, …)

Vectorize the code to eliminate TLB capacity misses

Ensures unit stride access (bottom bandwidth ceiling)

Tune for optimal VL Clovertown pinned to

lower BW ceiling1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aba

nk c

onflic

ts

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Sun T2+ T5140(Victoria Falls)

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

peak DP

mul/add imbalance

data

set f

its in

snoo

p filt

er

Intel Xeon E5345(Clovertown)

w/out SIMD

w/out ILP

Page 98: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

98

Roofline model for LBMHD(SIMDization + cache bypass)

Make SIMDization explicit Technically, this swaps ILP

and SIMD ceilings Use cache bypass

instruction: movntpd Increases flop:byte ratio to

~1.0 on x86/Cell

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aba

nk c

onflic

ts

peak DP

w/out ILP

w/out SIMD

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Sun T2+ T5140(Victoria Falls)

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

peak DP

mul/add imbalance

data

set f

its in

snoo

p filt

er

Intel Xeon E5345(Clovertown)

w/out ILP

w/out SIMD

Page 99: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

99

Roofline model for LBMHD(SIMDization + cache bypass)

Make SIMDization explicit Technically, this swaps ILP

and SIMD ceilings Use cache bypass

instruction: movntpd Increases flop:byte ratio to

~1.0 on x86/Cell

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aba

nk c

onflic

ts

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Sun T2+ T5140(Victoria Falls)

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

peak DP

mul/add imbalance

data

set f

its in

snoo

p filt

er

Intel Xeon E5345(Clovertown)

w/out SIMD

w/out ILP

3 out o

f 4 m

achin

es

hit th

e Roofli

ne

3 out o

f 4 m

achin

es

hit th

e Roofli

ne

Page 100: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

LBMHD Performance(Summary)

100

Reference code is clearly insufficient Portable C code is insufficient on Barcelona and Cell Cell gets all its performance from the SPEs

despite only 2x the area, and 2x the peak DP FLOPs

Page 101: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

101

Summary

Page 102: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

Summary

102

Introduced the Roofline Model Apply bound and bottleneck analysis Performance and requisite optimizations are inferred visually

Extended auto-tuning to multicore Fundamentally different from running auto-tuned serial code on multicore SMPs. Apply the concept to LBMHD and SpMV.

Auto-tuning LBMHD and SpMV Multicore has had a transformative effect on auto-tuning.

(move from latency limited to bandwidth limited) Maximizing memory bandwidth and minimizing memory traffic is key. Compilers are reasonably effective at in-core optimizations,

but totally ineffective at cache and memory issues. Library or framework is a necessity in managing these issues.

Comments on architecture Ultimately machines are bandwidth-limited without new algorithms Architectures with caches required significantly more tuning than the local store-

based Cell

Page 103: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

103

Acknowledgements

Research supported by: Microsoft and Intel funding (Award #20080469) DOE Office of Science under contract number DE-AC02-05CH11231 NSF contract CNS-0325873 Sun Microsystems - Niagara2 / Victoria Falls machines AMD - access to Quad-core Opteron (barcelona) access Forschungszentrum Jülich - access to QS20 Cell blades IBM - virtual loaner program to QS20 Cell blades

Page 104: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams SWWilliams@lbl.gov.

FUTURE TECHNOLOGIES GROUP

104

Questions ?Samuel Williams, Andrew Waterman, David Patterson, "Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures", Communications of the ACM (CACM), April 2009.

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008.

Best Paper, Application Track