L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams [email protected].

FUTURE TECHNOLOGIES GROUP

1

Auto-tuning Memory Intensive Kernels for Multicore

Sam Williams [email protected]


Outline

1. Challenges arising from Optimizing Single Thread Performance

2. New Challenges Arising when Optimizing Multicore SMP Performance

3. Performance Modeling and Little’s Law

4. Multicore SMPs of Interest

5. Auto-tuning Sparse Matrix-Vector Multiplication (SpMV)

6. Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD)

7. Summary

2


Challenges arising fromOptimizing Single Thread

Performance

3


Instruction-Level Parallelism

On modern pipelined architectures, operations (like floating-point addition) have a latency of 4-6 cycles (until the result is ready).

However, independent adds can be pipelined one after another. Although this increases the peak flop rate,

one can only achieve peak flops on the condition that on any given cycle the program has >4 independent adds ready to execute.

failing to do so will result in a >4x drop in performance. The problem is exacerbated by superscalar or VLIW architectures

like POWER or Itanium.

One must often reorganize kernels to express more instruction-level parallelism

4


1x1 Register Block

ILP Example (1x1 BCSR)

for(all rows){ y0 = 0.0; for(all tiles in this row){ y0+=V[i]*X[C[i]] } y[r] = y0;}

Consider the core of SpMV No ILP in the inner loop OOO can’t accelerate serial FMAs

FMAFMA

time

=0

FMAFMA

time

=4

FMAFMA

time

=8

FMAFMA

time

=1

2

FMAFMA

time

=1

6



for(all rows){ y0 = 0.0; for(all tiles in this row){ y0+=V[i ]*X[C[i] ] y0+=V[i+1]*X[C[i]+1] y0+=V[i+2]*X[C[i]+2] y0+=V[i+3]*X[C[i]+3] } y[r] = y0;}

What about 1x4 BCSR ? Still no ILP in the inner loop FMAs are still dependent on each other

FMAFMA

time

=0

FMAFMA

time

=4

FMAFMA

time

=8

FMAFMA

time

=1

2

1x4 Register Block



for(all rows){ y0 = 0.0;y1 = 0.0; y2 = 0.0;y3 = 0.0; for(all tiles in this row){ y0+=V[i ]*X[C[i]] y1+=V[i+1]*X[C[i]] y2+=V[i+2]*X[C[i]] y3+=V[i+3]*X[C[i]] } y[r+0] = y0; y[r+1] = y1; y[r+2] = y2; y[r+3] = y3;}

What about 4x1 BCSR ? Updating 4 different rows The 4 FMAs are independent Thus they can be pipelined.

FMAFMA

time

=0

FMAFMA

time

=1

FMAFMA

time

=2

FMAFMA

time

=3

4x1 Register Block

FMAFMA

time

=4

FMAFMA

time

=5

FMAFMA

time

=6

FMAFMA

time

=7


Data-level Parallelism

DLP = apply the same operation to multiple independent operands.

Today, rather than relying on superscalar issue, many architectures have adopted SIMD as an efficient means of boosting peak performance. (SSE, Double Hummer, AltiVec, Cell, GPUs, etc…)

Typically these instructions operate on four single precision

(or two double precision) numbers at a time. However, some are more GPUs(32), Larrabee(16), and AVX(8) Failing to use these instructions may cause a 2-32x drop in

performance Unfortunately, most compilers utterly fail to generate these

instructions.8

++ ++ ++ ++


Memory-Level Parallelism (1)

Although caches may filter many memory requests, in HPC many memory references will still go all the way to DRAM.

Memory latency (as measured in core cycles) grew by an order of magnitude in the 90’s

Today, the latency of a memory operation can exceed 200 cycles (1 double every 80ns is unacceptably slow).

Like ILP, we wish to pipeline requests to DRAM Several solutions exist today

HW stream prefetchers HW Multithreading (e.g. hyperthreading) SW line prefetch DMA

9


Memory-Level Parallelism (2)

HW stream prefetchers are by far the easiest to implement and exploit.

They detect a series of consecutive cache misses and speculate that the next addresses in the series will be needed. They then prefetch that data into the cache or a dedicated buffer.

To effectively exploit a HW prefetcher, ensure your array references accesses 100’s of consecutive addresses.

e.g. read A[i]…A[i+255] without any jumps or discontinuities

This force limits the effectiveness (shape) of the cache blocking you implemented in HW1 as you accessed:

A[(j+0)*N+i]…A[(j+0)*N+i+B], jumpA[(j+1)*N+i]…A[(j+1)*N+i+B], jumpA[(j+2)*N+i]…A[(j+2)*N+i+B], jump…

10


Branch Misprediction

A mispredicted branch can stall subsequent instructions by ~10 cycles.

Select a loop structure that maximizes the loop length

(keeps mispredicted branches per instruction to a minimum)

Some architectures support predication either in hardware or software to eliminate branches (transforms control dependencies into data dependencies)

11


Cache Subtleties

Set associative caches have a limited number of sets (S) and

ways (W), the product of which is the capacity (in cache lines). As seen in HW1, it can be beneficial to reorganize kernels to

reduce the working size and eliminate capacity misses.

Conflict misses can severely impair performance, be very challenging to identify and eliminate.

Given address may only be placed in W different locations in the cache.

Poor access patterns or roughly power of two problem sizes can be especially bad

Results in too many addresses mapped to the same set. Not all of them can be kept in the cache and some will have to be evicted.

Padding arrays (problem sizes) or skewing access pattern can eliminate conflict misses.

12


Array padding Example

Padding changes the data layout Consider a large matrix with a power of two number of

double A[N][M];// column major with M~pow2 A[i][j] and A[i+1][j] will likely be mapped to the same set.

We can pad each column with a couple extra rows

double A[N][M+pad];

Such techniques are applicable in many other domains (stencils, lattice-boltzman methods, etc…)

13


New Challenges Arising whenOptimizing Multicore SMP Performance

14


What are SMPs ?

SMP = shared memory parallel. Multiple chips (typically < 32 threads) can address any location in a

large shared memory through a network or bus Caches are almost universally coherent You can still run MPI on an SMP, but

you trade free (always pay for it) cache-coherency traffic for additional memory traffic (for explicit communication)

you trade user-level function calls for system calls Alternately, you use a SPMD threading model

(pthreads, OpenMP, UPC)

If communication between cores or threads is significant, then threaded implementations win out.

As computation:communication ratio increases, MPI asymptotically approached threaded implementations.

15


What is multicore ?What are multicore SMPs ?

16

Today, multiple cores are integrated on the same chip Almost universally this is done in a SMP fashion For “convince”, programming multicore SMPs is indistinguishable

from programming multi-socket SMPs. (easy transition)

Multiple cores can share: memory controllers caches occasionally FPUs

Although there was a graceful transition

from multiple sockets to multiple cores

from the point of view of correctness,

achieving good performance can be

incredibly challenging.


Affinity

17

0 12 34 56 7 0 12 3 4 5 6 7

We may wish one pair of threads to share a cache, but be disjoint from another pair of threads.

We can control the mapping of threads to linux processors via #include<sched.h> + sched_set/getaffinity()

But, mapping of linux processors to physical cores/sockets is machine/OS dependent.

Inspect /proc/cpuinfo or use PLPA


NUMA Challenges

18

Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among

between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory


Implicit allocation for NUMA

Consider an OpenMP example for implicitly NUMA initialization:

#pragma omp parallel for for (j=0; j<N; j++) { a[j] = 1.0; b[j] = 2.0; c[j] = 0.0;}

The first accesses to the array (read or write) must be parallelized. DO NOT TOUCH BETWEEN MALLOC AND INIT

When the for loop is parallelized, each thread initializes a range of i Exploits the OS’s first touch policy. Relies on assumption OpenMP maps threads correctly

19


New Cache Challenges

shared caches + SPMD programming models can exacerbate conflict misses.

Individually, threads may produce significant cache associativity pressure based on access pattern. (power of 2 problem sizes)

Collectively, threads may produce excessive cache associativity pressure. (power of 2 problem sizes decomposed with a power of two number of threads)

This can be much harder to diagnose and correct This problem arises whether using MPI or a threaded model.

20


New Memory Challenges

The number of memory controllers and bandwidth on multicore SMPs is growing much slower than the number of cores.

codes are becoming increasingly memory-bound as a fraction of the cores can saturate a socket’s memory bandwidth

Multicore has traded bit-or word-parallelism for thread-parallelism. However, main memory is still built from bit-parallel devices (DIMMs) Must restructure memory-intensive apps to the bit-parallel nature of

DIMMs (sequential access)

21


Synchronization

Using multiple concurrent threads can create ordering and race errors.

Locks are one solution. Must balance granularity and frequency

SPMD programming model + barriers are often a better/simpler solution.

spin barriers can be orders of magnitude faster than pthread library barriers. (Rajesh Nishtala, HotPar’09)

22


Performance Modeling and Little’s Law

23


System Abstraction

Abstractly describe any system (or subsystem) as a combination of black-boxed storage, computational units, and the bandwidth between them.

These can be hierarchically

composed. A volume of data must be

transferred from the storage

component, processed, and

another volume of data must be returned. Consider the basic parameters governing performance of the

channel: Bandwidth, Latency, Concurrency Bandwidth can be measured in: GB/s, Gflop/s, MIPS, etc… Latency can be measured in: seconds, cycles, etc… Concurrency the volume in flight across the channel, and can be

measured in bytes, cache lines, operations, instructions, etc…

24

DRAM

CPU

Cache

core

RF

FU’s


Little’s Law

Little’s law related concurrency, bandwidth, and latency To achieve peak bandwidth, one must satisfy:

Concurrency = Latency × Bandwidth For example, a memory controller with 20GB/s of bandwidth, and 100ns

of latency requires the CPU to express 2KB of concurrency (memory-level parallelism)

Similarly, given an expressed concurrency, one can bound attained performance:

That is, as more concurrency is injected, we get progressively better performance

Note, this assumes continual, pipelined accesses.

25

BWattained = minConcurrencyexpressed / Latency

BWmax


Where’s the bottleneck?

We’ve described bandwidths DRAM CPU Cache Core Register File Functional units

But in an application, one of these

may be a performance-limiting

bottleneck.

We can take any pair and compare how quickly data can be transferred to how quickly it can be processed to determine the bottleneck.

26

DRAM

CPU

Cache

core

RF

FU’s


27

Arithmetic Intensity

Consider the first case (DRAM-CPU) True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes

Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality), but remains constant on others

Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses.

O( N )O( log(N) )

O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTsDense Linear Algebra

(BLAS3)Particle Methods


Kernel Arithmetic Intensityand Architecture

28

For a given architecture, one may calculate its flop:byte ratio. For a 2.3GHz Quad Core Opteron,

1 SIMD add + 1 SIMD multiply per cycle per core 12.8GB/s of DRAM bandwidth = 36.8 / 12.8 ~ 2.9 flops per byte

When a kernel’s arithmetic intensity is substantially

less than the architecture’s flop:byte ratio, transferring

data will take longer than computing on it

memory-bound When a kernel’s arithmetic intensity is substantially greater than the

architecture’s flop:byte ratio, computation will take longer than data transfers

compute-bound


29

Memory Traffic Definition

Total bytes to/from DRAM

Can categorize into: Compulsory misses Capacity misses Conflict misses Write allocations …

Oblivious of lack of sub-cache line spatial locality


Roofline ModelBasic Concept

30

Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.

where optimization i can be SIMDize, or unroll, or SW prefetch, … Given a kernel’s arithmetic intensity (based on DRAM traffic after

being filtered by the cache), programmers can inspect the figure, and bound performance.

Moreover, provides insights as to which optimizations will potentially be beneficial.

AttainablePerformanceij

= minFLOP/s with Optimizations1-i

AI * Bandwidth with Optimizations1-j


Roofline ModelBasic Concept

31

Plot on log-log scale Given AI, we can easily

bound performance But architectures are much

more complicated

We will bound performance as we eliminate specific forms of in-core parallelism

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth


Roofline Modelcomputational ceilings

32

Opterons have dedicated multipliers and adders.

If the code is dominated by adds, then attainable performance is half of peak.

We call these Ceilings They act like constraints on

performance


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

mul / add imbalance



33

Opterons have 128-bit datapaths.

If instructions aren’t SIMDized, attainable performance will be halved


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

mul / add imbalance

w/out SIMD



34

On Opterons, floating-point instructions have a 4 cycle latency.

If we don’t express 4-way ILP, performance will drop by as much as 4x


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidth

mul / add imbalance


Roofline Modelcommunication ceilings

35

We can perform a similar exercise taking away parallelism from the memory subsystem


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth



36

Explicit software prefetch instructions are required to achieve peak bandwidth


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out

SW

pre

fetch



37

Opterons are NUMA As such memory traffic

must be correctly balanced among the two sockets to achieve good Stream bandwidth.

We could continue this by examining strided or random memory access patterns


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out

SW

pre

fetch

w/out

NUM

A


Roofline Modelcomputation + communication ceilings

38

We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth.


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

peak DP

mul / add imbalance

w/out ILP

Stream

Ban

dwidth

w/out

SW

pre

fetch

w/out

NUM

A


Roofline Modellocality walls

39

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination


atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A


peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

FLOPs

Compulsory MissesAI =



40





atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A


peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

FLOPs

Allocations + Compulsory MissesAI =



41





atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A


peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

+ca

pa

city miss tra

ffic

FLOPs

Capacity + Allocations + CompulsoryAI =



42





atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A


peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

+ca

pa

city miss tra

ffic

+co

nflict m

iss traffic

FLOPs

Conflict + Capacity + Allocations + CompulsoryAI =


43

Optimization Categorization

Maximizing (attained)In-core Performance

Minimizing (total)Memory Traffic

Maximizing (attained)Memory Bandwidth


44


MinimizingMemory Traffic

MaximizingMemory Bandwidth

MaximizingIn-core Performance

•Exploit in-core parallelism (ILP, DLP, etc…)

•Good (enough) floating-point balance


45







unroll &jam

explicitSIMD

reorder

eliminatebranches


unroll &jam

explicitSIMD

reorder

eliminatebranches

46







•Exploit NUMA

•Hide memory latency

•Satisfy Little’s Law

memoryaffinity SW

prefetch

DMAlists

unit-stridestreams

TLBblocking


memoryaffinity SW

prefetch

DMAlists

unit-stridestreams

TLBblocking

unroll &jam

explicitSIMD

reorder

eliminatebranches

47






•Exploit NUMA




Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior

cacheblockingarray

padding

compressdata

streamingstores


48







•Exploit NUMA



memoryaffinity SW

prefetch

DMAlists

unit-stridestreams

TLBblocking


cacheblockingarray

padding

compressdata

streamingstores

unroll &jam

explicitSIMD

reorder

eliminatebranches



49

Optimizations remove these walls and ceilings which act to constrain performance.


atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A


peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

+ca

pa

city miss tra

ffic

+co

nflict m

iss traffic



50



atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A


peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic



51



atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out

SW

pre

fetch

w/out

NUM

A


peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic



52



atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16


peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic


memoryaffinity SW

prefetch

DMAlists

unit-stridestreams

TLBblocking

cacheblockingarray

padding

compressdata

streamingstores

unroll &jam

explicitSIMD

reorder

eliminatebranches

53







•Exploit NUMA




Each optimization has

a large parameter space

What are the optimal parameters?Each optimization has

a large parameter space

What are the optimal parameters?


54

Auto-tuning?

Provides performance portability across the existing breadth and evolution of microprocessors

One time up front productivity cost is amortized by the number of machines its used on

Auto-tuning does not invent new optimizations Auto-tuning automates the code generation and exploration of

the optimization and parameter space Two components:

parameterized code generator (we wrote ours in Perl) Auto-tuning exploration benchmark

(combination of heuristics and exhaustive search) Can be extended with ISA specific optimizations (e.g. DMA, SIMD)


55

Multicore SMPsof Interest(used throughout the rest of the talk)


56

Multicore SMPs Used

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)


57

Multicore SMPs Used(Conventional cache-based memory hierarchy)




58

Multicore SMPs Used(local store-based memory hierarchy)




59

Multicore SMPs Used(CMT = Chip-MultiThreading)




60

Multicore SMPs Used(threads)



8 threads 8 threads

16* threads128 threads

*SPEs only


61

Multicore SMPs Used(peak double precision flops)



75 GFlop/s 74 Gflop/s

29* GFlop/s19 GFlop/s

*SPEs only


62

Multicore SMPs Used(total DRAM bandwidth)



21 GB/s (read)

10 GB/s (write)21 GB/s

51 GB/s42 GB/s (read)

21 GB/s (write)

*SPEs only


63

Multicore SMPs Used(Non-Uniform Memory Access - NUMA)



*SPEs only


Roofline Modelfor these multicore SMPs

64

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out FMA

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8


att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth

on sm

all d

atas

ets

peak DP

Bandw

idth

on la

rge

data

sets

0.5

1.0

1/8


att

ain

ab

le G

FL

OP

/s2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16


mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8


att

ain

ab

le G

FL

OP

/s2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out SIMD

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

misa

ligne

d DM

A

w/out

NUM

A

Note, the multithreaded Niagara is limited by the instruction mix rather than a lack of expressed in-core parallelism

Clearly some architectures are more dependent on bandwidth optimizations while others on in-core optimizations.

0.5

1.0

1/8


att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A


65

Auto-tuning Sparse Matrix-Vector Multiplication (SpMV)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.


66

Sparse MatrixVector Multiplication

What’s a Sparse Matrix ? Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data to reconstruct the matrix structure

What’s SpMV ? Evaluate y=Ax A is a sparse matrix, x & y are dense vectors

Challenges Very low arithmetic intensity (often <0.166 flops/byte) Difficult to exploit ILP (bad for pipelined or superscalar), Difficult to exploit DLP (bad for SIMD)

(a)algebra conceptualization

(c)CSR reference code

for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0;}

A x y

(b)CSR data structure

A.val[ ]

A.rowStart[ ]

...

...

A.col[ ]...


67

The Dataset (matrices)

Unlike dense BLAS, performance is dictated by sparsity Suite of 14 matrices All bigger than the caches of our SMPs We’ll also include a median performance number

Dense

ProteinFEM /

SpheresFEM /

CantileverWind

TunnelFEM /Harbor

QCDFEM /Ship

Economics Epidemiology

FEM /Accelerator

Circuit webbase

LP

2K x 2K Dense matrixstored in sparse format

Well Structured(sorted by nonzeros/row)

Poorly Structuredhodgepodge

Extreme Aspect Ratio(linear programming)


68

SpMV Performance(simple parallelization)

Out-of-the box SpMV performance on a suite of 14 matrices

Scalability isn’t great Is this performance

good?

Naïve Pthreads

Naïve


NUMA for SpMV

On NUMA architectures, all large arrays should be partitioned either explicitly (multiple malloc()’s + affinity) implicitly (parallelize initialization and rely on first touch)

You cannot partition on granularities less than the page size 512 elements on x86 2M elements on Niagara

For SpMV, partition the matrix and

perform multiple malloc()’s Pin submatrices so they are

co-located with the cores tasked

to process them

69


Prefetch for SpMV

SW prefetch injects more MLP into the memory subsystem.

Can try to prefetch the values indices source vector or any combination thereof

In general, should only insert one prefetch per cache line (works best on unrolled code)

70

for(all rows){ y0 = 0.0; y1 = 0.0; y2 = 0.0; y3 = 0.0; for(all tiles in this row){ PREFETCH(V+i+PFDistance); y0+=V[i ]*X[C[i]] y1+=V[i+1]*X[C[i]] y2+=V[i+2]*X[C[i]] y3+=V[i+3]*X[C[i]] } y[r+0] = y0; y[r+1] = y1; y[r+2] = y2; y[r+3] = y3;}


SpMV Performance(NUMA and Software Prefetching)

71

NUMA-aware allocation is essential on memory-bound NUMA SMPs.

Explicit software prefetching can boost bandwidth and change cache replacement policies

Cell PPEs are likely latency-limited.

used exhaustive search


ILP/DLP vs Bandwidth

In the multicore era, which is the bigger issue? a lack of ILP/DLP (a major advantage of BCSR) insufficient memory bandwidth per core

There are many architectures than when running low arithmetic intensity kernels, there is so little available memory bandwidth (per core) that you won’t notice a complete lack of ILP

Perhaps we should concentrate on minimizing memory traffic rather than maximizing ILP/DLP

Rather than benchmarking every combination, just

Select the register blocking that minimizes the matrix foot print.

72


SpMV Performance(Matrix Compression)

73

After maximizing memory bandwidth, the only hope is to minimize memory traffic.

exploit: register blocking other formats smaller indices

Use a traffic minimization heuristic rather than search

Benefit is clearly

matrix-dependent. Register blocking enables

efficient software prefetching (one per cache line)


Cache blocking for SpMV

Cache-blocking sparse matrices is very different than cache-blocking dense matrices.

Rather than changing loop bounds, store entire submatrices contiguously.

The columns spanned by each cache

block are selected so that all submatrices

place the same pressure on the cache

i.e. touch the same number of unique

source vector cache lines

TLB blocking is a similar concept but

instead of on 8 byte granularities,

it uses 4KB granularities

74

thre

ad 0

thre

ad 1

thre

ad 2

thre

ad 3


75

Auto-tuned SpMV Performance(cache and TLB blocking)

Fully auto-tuned SpMV performance across the suite of matrices

Why do some optimizations work better on some architectures?

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve


76

Auto-tuned SpMV Performance(architecture specific optimizations)


Included SPE/local store optimized version



+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve


77

Auto-tuned SpMV Performance(max speedup)





+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

2.7x2.7x 4.0x4.0x

2.9x2.9x 35x35x


78

Auto-tuned SpMV Performance(architecture specific optimizations)




Performance is better,but is performance good?


+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Auto-tuning resulted in better performance,

but did it result in good performance?

Auto-tuning resulted in better performance,

but did it result in good performance?


79

Roofline model for SpMV

Double precision roofline models

In-core optimizations 1..i DRAM optimizations 1..j

FMA is inherent in SpMV (place at bottom)

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade


Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

data

set d

atas

et fit

s in

snoo

p filt

er

GFlopsi,j(AI) = min InCoreGFlopsi

StreamBWj * AI


80

Roofline model for SpMV(overlay arithmetic intensity)

Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory

flop:byte < 0.166

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve SPEimplementation

IBM QS20Cell Blade




data

set d

atas

et fit

s in

snoo

p filt

er


81

Roofline model for SpMV(out-of-the-box parallel)

Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory

flop:byte < 0.166 For simplicity: dense

matrix in sparse format1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A


IBM QS20Cell Blade




data

set d

atas

et fit

s in

snoo

p filt

er


82

Roofline model for SpMV(NUMA & SW prefetch)

compulsory flop:byte ~ 0.166

utilize all memory channels

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A


IBM QS20Cell Blade




data

set d

atas

et fit

s in

snoo

p filt

er


83

Roofline model for SpMV(matrix compression)

Inherent FMA Register blocking improves

ILP, DLP, flop:byte ratio, and FP% of instructions

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

data

set d

atas

et fit

s in

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade





84

Roofline model for SpMV(matrix compression)

Inherent FMA Register blocking improves

ILP, DLP, flop:byte ratio, and FP% of instructions

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

data

set f

its in

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade




Perfo

rman

ce is

bandwid

th li

mite

d

Perfo

rman

ce is

bandwid

th li

mite

d


85

SpMV Performance(summary)

Median SpMV performance aside, unlike LBMHD, SSE was unnecessary to achieve performance Cell still requires a non-portable, ISA-specific implementation to achieve good

performance. Novel SpMV implementations may require ISA-specific (SSE) code to achieve

better performance.


86

Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamics

(LBMHD)Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008.

Best Paper, Application Track


87

LBMHD

Plasma turbulence simulation via Lattice Boltzmann Method Two distributions:

momentum distribution (27 scalar components) magnetic distribution (15 vector components)

Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector)

Arithmetic Intensity: Must read 73 doubles, and update 79 doubles per lattice update (1216 bytes) Requires about 1300 floating point operations per lattice update Just over 1.0 flops/byte (ideal)

Cache capacity requirements are independent of problem size Two problem sizes:

643 (0.3 GB) 1283 (2.5 GB)

periodic boundary

conditionsmomentum distribution

14

4

13

16

5

8

9

21

12

+Y

2

25

1

3

24

23

22

26

0

18

6

17

19

7

10

11

20

15

+Z

+X

magnetic distribution

14

13

16

21

12

25

24

23

22

26

18

17

19

20

15

+Y

+Z

+X

macroscopic variables

+Y

+Z

+X


88

LBMHD Performance(reference implementation)

Generally, scalability looks good

Scalability is good but is performance good?

*collision() only

Naïve+NUMA


LBMHD Performance(lattice-aware array padding)

89

LBMHD touches >150 arrays.

Most caches have limited associativity

Conflict misses are likely Apply heuristic to pad

arrays

+Padding

Naïve+NUMA


Vectorization

Two phases with a lattice method’s collision() operator: reconstruction of macroscopic variables updating discretized velocities

Normally this is done one point at a time. Change to do a vector’s worth at a time (loop interchange + tuning)

90


91

LBMHD Performance(vectorization)

Restructure loops to attain good TLB page locality and streaming accesses

*collision() only

+Vectorization

+Padding

Naïve+NUMA


92

LBMHD Performance(architecture specific optimizations)

Add unrolling and reordering of inner loop

Additionally, it exploits SIMD where the compiler doesn’t

Include a SPE/Local Store optimized version

*collision() only

+Explicit SIMDization

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Naïve+NUMA

+small pages


93

LBMHD Performance(architecture specific optimizations)

Add unrolling and reordering of inner loop

Additionally, it exploits SIMD where the compiler doesn’t

Include a SPE/Local Store optimized version

*collision() only

+Explicit SIMDization

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Naïve+NUMA

+small pages

1.6x1.6x 4x4x

3x3x 130x130x


94

Roofline model for LBMHD

Far more adds than multiplies (imbalance)

Huge data sets

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aba

nk c

onflic

ts

peak DP

mul/add imbalance

data

set f

its in

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade




w/out SIMD

w/out ILP


95

Roofline model for LBMHD(overlay arithmetic intensity)


Essentially random access to memory

Flop:byte ratio ~0.7 NUMA allocation/access Little ILP No DLP High conflict misses

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aba

nk c

onflic

ts

peak DP

mul/add imbalance

data

set f

its in

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A


IBM QS20Cell Blade




w/out SIMD

w/out ILP


96

Roofline model for LBMHD(out-of-the-box parallel performance)


Essentially random access to memory

Flop:byte ratio ~0.7 NUMA allocation/access Little ILP No DLP High conflict misses

Peak VF performance with 64 threads (out of 128) - high conflict misses

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aba

nk c

onflic

ts

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A


IBM QS20Cell Blade



1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

peak DP

mul/add imbalance

data

set f

its in

snoo

p filt

er


w/out SIMD

w/out ILP


97

Roofline model for LBMHD(Padding, Vectorization, Unrolling, Reordering, …)

Vectorize the code to eliminate TLB capacity misses

Ensures unit stride access (bottom bandwidth ceiling)

Tune for optimal VL Clovertown pinned to

lower BW ceiling1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aba

nk c

onflic

ts

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A


IBM QS20Cell Blade



1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

peak DP

mul/add imbalance

data

set f

its in

snoo

p filt

er


w/out SIMD

w/out ILP


98

Roofline model for LBMHD(SIMDization + cache bypass)

Make SIMDization explicit Technically, this swaps ILP

and SIMD ceilings Use cache bypass

instruction: movntpd Increases flop:byte ratio to

~1.0 on x86/Cell

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aba

nk c

onflic

ts

peak DP

w/out ILP

w/out SIMD

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade



1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

peak DP

mul/add imbalance

data

set f

its in

snoo

p filt

er


w/out ILP

w/out SIMD


99

Roofline model for LBMHD(SIMDization + cache bypass)

Make SIMDization explicit Technically, this swaps ILP

and SIMD ceilings Use cache bypass

instruction: movntpd Increases flop:byte ratio to

~1.0 on x86/Cell

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aba

nk c

onflic

ts

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade



1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

peak DP

mul/add imbalance

data

set f

its in

snoo

p filt

er


w/out SIMD

w/out ILP

3 out o

f 4 m

achin

es

hit th

e Roofli

ne

3 out o

f 4 m

achin

es

hit th

e Roofli

ne


LBMHD Performance(Summary)

100

Reference code is clearly insufficient Portable C code is insufficient on Barcelona and Cell Cell gets all its performance from the SPEs

despite only 2x the area, and 2x the peak DP FLOPs


101

Summary


Summary

102

Introduced the Roofline Model Apply bound and bottleneck analysis Performance and requisite optimizations are inferred visually

Extended auto-tuning to multicore Fundamentally different from running auto-tuned serial code on multicore SMPs. Apply the concept to LBMHD and SpMV.

Auto-tuning LBMHD and SpMV Multicore has had a transformative effect on auto-tuning.

(move from latency limited to bandwidth limited) Maximizing memory bandwidth and minimizing memory traffic is key. Compilers are reasonably effective at in-core optimizations,

but totally ineffective at cache and memory issues. Library or framework is a necessity in managing these issues.

Comments on architecture Ultimately machines are bandwidth-limited without new algorithms Architectures with caches required significantly more tuning than the local store-

based Cell


103

Acknowledgements

Research supported by: Microsoft and Intel funding (Award #20080469) DOE Office of Science under contract number DE-AC02-05CH11231 NSF contract CNS-0325873 Sun Microsystems - Niagara2 / Victoria Falls machines AMD - access to Quad-core Opteron (barcelona) access Forschungszentrum Jülich - access to QS20 Cell blades IBM - virtual loaner program to QS20 Cell blades


104

Questions ?Samuel Williams, Andrew Waterman, David Patterson, "Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures", Communications of the ACM (CACM), April 2009.

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008.

Best Paper, Application Track

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams [email protected].

Documents

xci y0

serial fmas fma time

4x1 register block fma

awrence b erkeley n

xci y2

xci y1

xci y3

peak performance