Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

1

BERKELEY PAR LAB

Auto-tuning Performanceon Multicore Computers

Samuel WilliamsDavid Patterson, advisor and chairKathy YelickSara McMains

Ph.D. Dissertation Talk

[email protected]

Auto-tuning Performanceon Multicore Computers



Computer Sciences

2

BERKELEY PAR LAB

Multicore ProcessorsMulticore Processors

3


Computer Sciences BERKELEY PAR LAB

Memory Bandwidth

Superscalar Era

Single thread performancescaled at 50% per year

Bandwidth increases muchmore slowly, but we couldadd additional bits orchannels.

Lack of diversity inarchitecture = lack ofindividual tuning

Power wall has cappedsingle thread performance

Log(

Per

form

ance

)

1990 1995 2000 2005

~50%

/ yea

r

ProcessorMemory Latency

~25% / year~12% / year

0% / year

7% / yearMulticore is the

agreed solution

4



Superscalar Era

Single thread performancescaled at 50% per year

Bandwidth increases muchmore slowly, but we couldadd additional bits orchannels.

Lack of diversity inarchitecture = lack ofindividual tuning

Power wall has cappedsingle thread performance

Log(

Per

form

ance

)

1990 1995 2000 2005

~50%

/ yea

r

~25% / year

ProcessorMemory

~12% / year

0% / year

But, no agreement on

what it should look like

5



Take existing power limitedsuperscalar processors, andadd cores.

Serial code shows noimprovement

Parallel performance mightimprove at 25% per year.= improvement in powerefficiency from process tech.

DRAM bandwidth is currentlyonly improving by ~12% peryear.

Intel / AMD model.

Multicore EraLo

g(P

erfo

rman

ce)

2003 2007

+0% / year (serial)

2005 2009

+25% / year (parallel)

~12% / year

6



Radically different approach Give up superscalar OOO

cores in favor of small,efficient, in-order cores

Huge, abrupt, shocking drop insingle thread performance

May eliminate power wall,allowing many cores, andperformance to scale at up to40% per year

Troubling, the number ofDRAM channels many need todouble every three years

Niagara, Cell, GPUs

Multicore EraLo

g(P

erfo

rman

ce)

2003 20072005 2009

+0% / year (serial)

+40%

/ yea

r (pa

rallel

)

+12% / year

7



Another option would be toreduce per core performance(and power) every year

Graceful degradation in serialperformance

Bandwidth is still an issue

Multicore EraLo

g(P

erfo

rman

ce)

2003 20072005 2009

+40%

/ yea

r (pa

rallel

)

-12% / year (serial)

+12% / year

8



Multicore Era

There are still many other architectural knobs Superscalar? Dual issue? VLIW ? Pipeline depth SIMD width ? Multithreading (vertical, simultaneous)? Shared caches vs. Private Caches ? FMA vs. MUL+ADD vs. MUL or ADD Clustering of cores Crossbars vs. Rings vs. Meshes

Currently, no consensus on optimal configuration As a result, there is a plethora of multicore architectures



Computer Sciences

9

BERKELEY PAR LAB

Computational MotifsComputational Motifs

10



Computational Motifs

Evolved from the Phil Colella’s Seven Dwarfs of Parallel Computing Numerical Methods common throughout scientific computing

Dense and Sparse Linear Algebra Computations on Structured and Unstructured Grids Spectral Methods N-Body Particle Methods Monte Carlo

Within each dwarf, there are a number of computational kernels

The Berkeley View, and subsequently Par Lab expanded these to many otherdomains in computing (embedded, SPEC, DB, Games) Graph Algorithms Combinational Logic Finite State Machines etc…

rechristened Computational Motifs

Each could be black-boxed into libraries or frameworks by domain experts. But how do we get good performance given the diversity of architecture?



Computer Sciences

11

BERKELEY PAR LAB

Auto-tuningAuto-tuning

12



Auto-tuning(motivation)

Given a huge diversity of processor architectures, an code hand-optimizedfor one architecture will likely deliver poor performance on another.Moreover, code optimized for one input data set may deliver poorperformance on another.

We want a single code base that delivers performance portability across thebreadth of architectures today and into the future.

Auto-tuners are composed of two principal components: A code generator based on high-level functionality rather than parsing C And the auto-tuner proper that searches for the optimal parameters for each

optimization.

Auto-tuner’s don’t invent or discover optimizations, they search through theparameter space for a variety of known optimizations.

Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL),and Sparse Methods(OSKI)

13



Auto-tuning(code generation)

The code generator produces many code variants for somenumerical kernel using known optimizations and transformations.

For example, cache blocking adds several parameterized loop nests. Prefetching adds parameterized intrinsics to the code Loop unrolling and reordering explicitly unrolls the code. Each unrolling

is a unique code variant.

Kernels can have dozens of different optimizations, some of whichcan produce hundreds of code variants.

The code generators used in this work are kernel-specific, and werewritten in Perl.

14



Auto-tuning(search)

In this work, we use two search techniques: Exhaustive - examine every combination of parameter for every

optimization. (often intractable) Heuristics - use knowledge of architecture or algorithm to restrict

the search space.

Parameter Space forOptimization A

Para

met

er S

pace

for

Opt

imiz

atio

n B


Para

met

er S

pace

for

Opt

imiz

atio

n B



Computer Sciences

15

BERKELEY PAR LAB

OverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work

Auto-tuning Performanceon Multicore ComputersAuto-tuning Performanceon Multicore Computers

16



Thesis Contributions

Introduced the Roofline Model Extended auto-tuning to the structured grid motif

(specifically LBMHD) Extended auto-tuning to multicore

Fundamentally different from running auto-tuned serial code onmulticore SMPs.

Apply the concept to LBMHD and SpMV.

Analyzed the breadth of multicore architectures in thecontext of auto-tuned SpMV and LBMHD

Discussed future directions in auto-tuning



Computer Sciences

17

BERKELEY PAR LAB

Multicore SMPsChapter 3

Multicore SMPsOverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work

18



Multicore SMP Systems_

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

Sun UltraSPARC T2+ T5140(Victoria Falls)

IBM QS20(Cell Blade)

667MHz FBDIMMs

MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

4GB

/s(e

ach

dire

ctio

n)

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K


2x64b controllers

Hyp

erTr

ansp

ort

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s



MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E25

6KM

FC

<20G

B/s

(eac

h di

rect

ion)

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)


SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

19



Multicore SMP Systems(Conventional cache-based memory hierarchy)





667MHz FBDIMMs



FSB10.66 GB/s

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

4GB

/s(e

ach

dire

ctio

n)

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K


2x64b controllers

Hyp

erTr

ansp

ort

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K


2x64b controllers

Hyp

erTr

ansp

ort

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s



MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s



MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

512MB XDR DRAM

25.6 GB/s

SP

E25

6KM

FC

<20G

B/s

(eac

h di

rect

ion)

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FC512MB XDR DRAM

25.6 GB/s

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

EIB (ring network)


EIB (ring network)


VMTPPE

512KL2

BIF

VMTPPE

512KL2

BIF

20



Multicore SMP Systems(local store-based memory hierarchy)




667MHz FBDIMMs



FSB10.66 GB/s

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

4GB

/s(e

ach

dire

ctio

n)

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K


2x64b controllers

Hyp

erTr

ansp

ort

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K


2x64b controllers

Hyp

erTr

ansp

ort

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s



MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s



MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar


BIF

512MB XDR DRAM

25.6 GB/s

VMTPPE

512KL2

<20G

B/s

(eac

h di

rect

ion)

BIF

512MB XDR DRAM

25.6 GB/s

VMTPPE

512KL2

XDR memory controllers XDR memory controllers

EIB (ring network) EIB (ring network)

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

21



Multicore SMP Systems(CMT = Chip Multithreading)



667MHz FBDIMMs



FSB10.66 GB/s

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

4GB

/s(e

ach

dire

ctio

n)

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K


2x64b controllers

Hyp

erTr

ansp

ort

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K


2x64b controllers

Hyp

erTr

ansp

ort


BIF

512MB XDR DRAM

25.6 GB/s

VMTPPE

512KL2

SP

E25

6KM

FC

<20G

B/s

(eac

h di

rect

ion)

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FC

BIF

512MB XDR DRAM

25.6 GB/s

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

XDR memory controllers XDR memory controllers

EIB (ring network) EIB (ring network)


8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s



MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s



MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

22



Multicore SMP Systems(peak double-precision FLOP rates)





667MHz FBDIMMs



FSB10.66 GB/s

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

4GB

/s(e

ach

dire

ctio

n)

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K


2x64b controllers

Hyp

erTr

ansp

ort

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K


2x64b controllers

Hyp

erTr

ansp

ort

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s



MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s



MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)


VMTPPE

512KL2

SP

E25

6KM

FC

<20G

B/s

(eac

h di

rect

ion)

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)


SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

74.66 GFLOP/s 73.60 GFLOP/s

29.25 GFLOP/s18.66 GFLOP/s

23



Multicore SMP Systems(DRAM pin bandwidth)





667MHz FBDIMMs



FSB10.66 GB/s

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

4GB

/s(e

ach

dire

ctio

n)

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K


2x64b controllers

Hyp

erTr

ansp

ort

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K


2x64b controllers

Hyp

erTr

ansp

ort

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s



MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s



MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)


VMTPPE

512KL2

SP

E25

6KM

FC

<20G

B/s

(eac

h di

rect

ion)

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)


SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

21 GB/s(read)10 GB/s(write)

21 GB/s

51 GB/s42 GB/s(read)21 GB/s(write)

24



Multicore SMP Systems(Non-Uniform Memory Access)





667MHz FBDIMMs



FSB10.66 GB/s

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

4GB

/s(e

ach

dire

ctio

n)

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K


2x64b controllers

Hyp

erTr

ansp

ort

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K


2x64b controllers

Hyp

erTr

ansp

ort

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s



MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s



MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)


VMTPPE

512KL2

SP

E25

6KM

FC

<20G

B/s

(eac

h di

rect

ion)

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)


SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2



Computer Sciences

25

BERKELEY PAR LAB

Roofline ModelChapter 4

Roofline ModelOverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work

26



Memory Traffic

Total bytes to/from DRAM

Can categorize into: Compulsory misses Capacity misses Conflict misses Write allocations …

Oblivious of lack of sub-cache line spatial locality

27



Arithmetic Intensity

For purposes of this talk, we’ll deal with floating-point kernels Arithmetic Intensity ~ Total FLOPs / Total DRAM Bytes Includes cache effects Many interesting problems have constant AI (w.r.t. problem size)

Bad given slowly increasing DRAM bandwidth Bandwidth and Traffic are key optimizations

A r i t h m e t i c I n t e n s i t y

O( N )O( log(N) )O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTsDense Linear Algebra

(BLAS3)Particle Methods

28



Basic Idea

Synthesize communication, computation, and locality into a singlevisually-intuitive performance figure using bound and bottleneckanalysis.

Given a kernel’s arithmetic intensity (based on DRAM traffic afterbeing filtered by the cache), programmers can inspect the figure,and bound performance.

Moreover, provides insights as to which optimizations will potentiallybe beneficial.

AttainablePerformanceij

= minFLOP/s with Optimizations1-i

AI * Bandwidth with Optimizations1-j

29



Constructing a Roofline Model(computational ceilings)

Plot on log-log scale Given AI, we can easily

bound performance But architectures are much

more complicated

We will bound performanceas we eliminate specificforms of in-core parallelism

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

30




Opterons have dedicatedmultipliers and adders.

If the code is dominated byadds, then attainableperformance is half of peak.

We call these Ceilings They act like constraints on

performance


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Stream

Ban

dwidthmul / add imbalance

31




Opterons have 128-bitdatapaths.

If instructions aren’tSIMDized, attainableperformance will be halved


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Stream

Ban


w/out SIMD

32




On Opterons, floating-pointinstructions have a 4 cyclelatency.

If we don’t express 4-wayILP, performance will dropby as much as 4x


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

w/out ILP

peak DP

Stream

Ban


33



Constructing a Roofline Model(communication ceilings)

We can perform a similarexercise taking awayparallelism from thememory subsystem


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

34




Explicit software prefetchinstructions are required toachieve peak bandwidth


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out S

W prefe

tch

35




Opterons are NUMA As such memory traffic

must be correctly balancedamong the two sockets toachieve good Streambandwidth.

We could continue this byexamining strided orrandom memory accesspatterns


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out S

W prefe

tch

w/out N

UMA

36



Constructing a Roofline Model(computation + communication)

We may boundperformance based on thecombination of expressedin-core parallelism andattained bandwidth.


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

peak DP

mul / add imbalance

w/out ILPStre

am B

andwidth

w/out S

W prefe

tch

w/out N

UMA

37



Constructing a Roofline Model(locality walls)

Remember, memory trafficincludes more than justcompulsory misses.

As such, actual arithmeticintensity may besubstantially lower.

Walls are unique to thearchitecture-kernelcombination


atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out S

W prefe

tch

w/out N

UMA


peak DP

Stream

Ban

dwidth

only compulsory m

iss traffic

FLOPsCompulsory Misses

AI =

38








atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out S

W prefe

tch

w/out N

UMA


peak DP

Stream

Ban

dwidth

only compulsory m

iss traffic+w

rite allocation traffic

FLOPsAllocations + Compulsory Misses

AI =

39








atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out S

W prefe

tch

w/out N

UMA


peak DP

Stream

Ban

dwidth

only compulsory m

iss traffic+w


+capacity miss traffic

FLOPsCapacity + Allocations + Compulsory

AI =

40








atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out S

W prefe

tch

w/out N

UMA


peak DP

Stream

Ban

dwidth

only compulsory m

iss traffic+w


+capacity miss traffic

+conflict miss traffic

FLOPsConflict + Capacity + Allocations + Compulsory

AI =

41



Roofline Models for SMPs

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out FMA

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth o

n small

datas

ets

peak DP

Bandw

idth o

n larg

e data

sets

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out SIMD

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

misalig

ned D

MA

w/out N

UMA

Note, the multithreadedNiagara is limited by theinstruction mix rather than alack of expressed in-coreparallelism

Clearly some architectures aremore dependent on bandwidthoptimizations while others onin-core optimizations.

0.5

1.0

1/8actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA



Computer Sciences

42

BERKELEY PAR LAB

Auto-tuningLattice-Boltzmann

Magnetohydrodynamics (LBMHD)Chapter 6

Auto-tuningLattice-Boltzmann

Magnetohydrodynamics (LBMHD)OverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work

43



Introduction to Lattice Methods

Structured grid code, with a series of time steps Popular in CFD Allows for complex boundary conditions No temporal locality between points in space within one time step Higher dimensional phase space

Simplified kinetic model that maintains the macroscopic quantities Distribution functions (e.g. 5-27 velocities per point in space) are used to

reconstruct macroscopic quantities Significant Memory capacity requirements

14

12

4

16

13

5

9

8

212

0

25

3

1

24

22

23

2618

15

6

19

17

7

11

10

20

+Z

+Y

+X

44



LBMHD(general characteristics)

Plasma turbulence simulation Two distributions:

momentum distribution (27 scalar components) magnetic distribution (15 vector components)

Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector)

Must read 73 doubles, and update 79 doubles per point in space Requires about 1300 floating point operations per point in space Just over 1.0 FLOPs/byte (ideal)

momentum distribution

14

4

13

16

5

8

9

21

12

+Y

2

25

1

3

24

23

22

26

0

18

6

17

19

7

10

11

20

15

+Z

+X

magnetic distribution

14

13

16

21

12

25

24

23

22

26

18

17

19

20

15

+Y

+Z

+X

macroscopic variables

+Y

+Z

+X

45



LBMHD(implementation details)

Data Structure choices: Array of Structures: no spatial locality, strided access Structure of Arrays: huge number of memory streams per thread, but guarantees

spatial locality, unit-stride, and vectorizes well

Parallelization Fortran version used MPI to communicate between nodes.

= bad match for multicore The version in this work uses pthreads for multicore

(this thesis is not about innovation in the threading model or programming language) MPI is not used when auto-tuning

Two problem sizes: 643 (~330MB) 1283 (~2.5GB)

46



SOA Memory Access Pattern

Consider a simple D2Q9 lattice method using SOA There are 9 read arrays, and 9 write arrays,

but all accesses are unit stride LBMHD has 73 read and 79 write streams per thread

(+1,0)

(0,+1)

(0,-1)

(-1,0) (0,0)

(+1,+1)

(+1,-1)(-1,-1)

(-1,+1)

x dimension

write_array[ ][ ]

read_array[ ][ ]

?

47




LBMHD has a AI of 0.7 onwrite allocate architectures,and 1.0 on those with cachebypass or no write allocate.

MUL / ADD imbalance Some architectures will be

bandwidth-bound, while otherscompute bound.

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


w/out FMA

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth o

n small

datas

ets

peak DP

Bandw

idth o

n larg

e data

sets

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


w/out SIMD

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

misalig

ned D

MA

w/out N

UMA

48



LBMHD Performance(reference implementation)

Standard cache-basedimplementation can beeasily parallelized withpthreads.

NUMA is implicitly exploited Although scalability looks

good, is performance ?Per

form

ance

Concurrency &Problem Size

49




Standard cache-basedimplementation can beeasily parallelized withpthreads.

NUMA is implicitly exploited Although scalability looks

good, is performance ?

50




Superscalar performance issurprisingly good given thecomplexity of the memoryaccess pattern.

Cell PPE performance isabysmal.

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


w/out FMA

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth o

n small

datas

ets

peak DP

Bandw

idth o

n larg

e data

sets

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


w/out SIMD

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

misalig

ned D

MA

w/out N

UMA

51



LBMHD Performance(array padding)

LBMHD touches >150arrays.

Most caches have limitedassociativity

Conflict misses are likely Apply heuristic to pad

arrays

52



LBMHD Performance(vectorization)

LBMHD touches > 150arrays.

Most TLBs have << 128entries.

Vectorization techniquecreates a vector of pointsthat are being updated.

Loops are interchangedand strip mined

Exhaustively search forthe optimal “vector length”that balances page localitywith L1 cache misses.

53



LBMHD Performance(other optimizations)

Heuristic-based softwareprefetching

Exhaustive search for low-level optimizations

Loop unrolling/reordering SIMDization Cache Bypass increases

arithmetic intensity by 50% Small TLB pages on VF

54



LBMHD Performance(Cell SPE implementation)

We can write a local storeimplementation and run iton the Cell SPEs.

Ultimately, Cell’s weak DPhampers performance

55



LBMHD Performance(speedup for largest problem)

We can write a local storeimplementation and run iton the Cell SPEs.

Ultimately, Cell’s weak DPhampers performance1.6x 4x

3x 130x

56



LBMHD Performance(vs. Roofline)

Most architectures reach theirroofline bound performance

Clovertown’s snoop filter isineffective.

Niagara suffers frominstruction mix issues

Cell PPEs are latency limited Cell SPEs are compute-bound

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


w/out FMA

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth o

n small

datas

ets

peak DP

Bandw

idth o

n larg

e data

sets

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


w/out SIMD

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

misalig

ned D

MA

w/out N

UMA

57



LBMHD Performance(Summary)

Reference code is clearly insufficient Portable C code is insufficient on Barcelona and Cell Cell gets all its performance from the SPEs

despite only 2x the area, and 2x the peak DP FLOPs



Computer Sciences

58

BERKELEY PAR LAB

Auto-tuningSparse Matrix-Vector Multiplication

Chapter 8

Auto-tuningSparse Matrix-Vector Multiplication


59



Sparse Matrix-Vector Multiplication

Sparse Matrix Most entries are 0.0 Performance advantage in only

storing/operating on the nonzeros Requires significant meta data

Evaluate: y = Ax A is a sparse matrix x & y are dense vectors

Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance

A x y

60



Dataset (Matrices)

Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M

Dense

Protein FEM /Spheres

FEM /Cantilever

WindTunnel

FEM /Harbor QCD FEM /

Ship Economics Epidemiology

FEM /Accelerator Circuit webbase

LP

2K x 2K Dense matrixstored in sparse format

Well Structured(sorted by nonzeros/row)

Poorly Structuredhodgepodge

Extreme Aspect Ratio(linear programming)

61




Reference SpMV implementationhas an AI of 0.166, but can readilyexploit FMA.

The best we can hope for is an AIof 0.25 for non-symmetricmatrices.

All architectures are memory-bound, but some may need in-core optimizations

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


w/out ILPw/out FMA

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth o

n small

datas

ets

peak DP

Bandw

idth o

n larg

e data

sets

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


w/out ILP

w/out FMA

w/out SIMD

peak DP

Stream

Ban

dwidt

h

misalig

ned D

MA

w/out N

UMA

62



SpMV Performance(Reference Implementation)

Reference implementationis CSR.

Simple parallelization byrows balancing nonzeros.

No implicit NUMAexploitation

Despite superscalar’s useof 8 cores, they see littlespeedup.

Niagara and PPEs shownear linear speedups

Per

form

ance

Matrix

63




Reference implementationis CSR.

Simple parallelization byrows balancing nonzeros.

No implicit NUMAexploitation

Despite superscalar’s useof 8 cores, they see littlespeedup.

Niagara and PPEs shownear linear speedups

64




Roofline for dense matrix insparse format.

Superscalars achieve bandwidth-limited performance

Niagara comes very close tobandwidth limit.

Clearly, NUMA and prefetchingwill be essential.

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


w/out ILPw/out FMA

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth o

n small

datas

ets

peak DP

Bandw

idth o

n larg

e data

sets

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


w/out ILP

w/out FMA

w/out SIMD

peak DP

Stream

Ban

dwidt

h

misalig

ned D

MA

w/out N

UMA

65



SpMV Performance(NUMA and Software Prefetching)

NUMA-aware allocation isessential on memory-boundNUMA SMPs.

Explicit softwareprefetching can boostbandwidth and changecache replacement policies

Cell PPEs are likelylatency-limited.

used exhaustive search

66



SpMV Performance(Matrix Compression)

After maximizing memorybandwidth, the only hope isto minimize memory traffic.

exploit: register blocking other formats smaller indices

Use a traffic minimizationheuristic rather thansearch

Benefit is clearlymatrix-dependent.

Register blocking enablesefficient softwareprefetching (one per cacheline)

67



SpMV Performance(Cache and TLB Blocking)

Based on limitedarchitectural knowledge,create heuristic to choosea good cache and TLBblock size

Hierarchically store theresultant blocked matrix.

Benefit can be significanton the most challengingmatrix

68



SpMV Performance(Cell SPE implementation)

Cache blocking can beeasily transformed intolocal store blocking.

With a few small tweaks forDMA, we can run asimplified version of theauto-tuner on Cell BCOO only 2x1 and larger always blocks

69



SpMV Performance(median speedup)

Cache blocking can beeasily transformed intolocal store blocking.

With a few small tweaks forDMA, we can run asimplified version of theauto-tuner on Cell BCOO only 2x1 and larger always blocks

2.8x 2.6x

1.4x 15x

70



SpMV Performance(vs. Roofline)

Roofline for dense matrix insparse format.

Compression improves AI Auto-tuning can allow us to slightly

exceed Stream bandwidth(but not pin bandwidth)

Cell PPEs perennially deliver poorperformance

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


w/out ILPw/out FMA

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth o

n small

datas

ets

peak DP

Bandw

idth o

n larg

e data

sets

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0


atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16


w/out ILP

w/out FMA

w/out SIMD

peak DP

Stream

Ban

dwidt

h

misalig

ned D

MA

w/out N

UMA

71



SpMV Performance(summary)

Median SpMV performance aside, unlike LBMHD, SSE was unnecessary to achieve

performance Cell still requires a non-portable, ISA-specific implementation to

achieve good performance. Novel SpMV implementations may require ISA-specific (SSE) code

to achieve better performance.



Computer Sciences

72

BERKELEY PAR LAB

SummarySummaryOverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work

73



Summary

Introduced the Roofline Model Apply bound and bottleneck analysis Performance and requisite optimizations are inferred visually

Extended auto-tuning to multicore Fundamentally different from running auto-tuned serial code on multicore SMPs. Apply the concept to LBMHD and SpMV.

Auto-tuning LBMHD and SpMV Multicore has had a transformative effect on auto-tuning.

(move from latency limited to bandwidth limited) Maximizing memory bandwidth and minimizing memory traffic is key. Compilers are reasonably effective at in-core optimizations,

but totally ineffective at cache and memory issues. Library or framework is a necessity in managing these issues.

Comments on architecture Ultimately machines are bandwidth-limited without new algorithms Architectures with caches required significantly more tuning than the local store-

based Cell



Computer Sciences

74

BERKELEY PAR LAB

Future Directionsin Auto-tuning

Chapter 9

Future Directionsin Auto-tuning


75



Future Work(Roofline)

Automatic generation of Roofline figures Kernel oblivious Select computational metric of interest Select communication channel of interest Designate common “optimizations” Requires a benchmark

Using performance counters to generate runtime Roofline figures Given a real kernel, we wish to understand the bottlenecks to

performance. Much friendlier visualization of performance counter data

76



Future Work(Making search tractable)


Para

met

er S

pace

for

Opt

imiz

atio

n B


Para

met

er S

pace

for

Opt

imiz

atio

n B

Given the explosion in optimizations, exhaustive search is clearlynot tractable.

Moreover, heuristics require extensive architectural knowledge. In our SC08 work, we tried a greedy approach (one optimization at a

time) We could make it iterative or, we could make it look like steepest

descent (with some local search)

77



Future Work(Auto-tuning Motifs)

We could certainly auto-tune other individual kernels in any motif,but this requires building a kernel-specific auto-tuner

However, we should strive for motif-wide auto-tuning. Moreover, we want to decouple data type (e.g. double precision)

from the parallelization structure.

1. A motif description or pattern language for each motif. e.g. taxonomy of structured grids + code snippet for stencil write auto-tuner parses these, and produces optimized code.

2. A series of DAG rewrite rules for each motif.Rules allow:Insertion of additional nodesDuplication of nodesReordering

78



Rewrite Rules

x

yA0

0

0

0

0

0

0

0

x

yA0

0

0

0

0

0

0

0

Consider SpMV In FP, each node in the DAG is a MAC DAG makes locality explicit (e.g. local store blocking) BCSR adds zeros to the DAG We can cut edges and reconnect them to enable parallelization. We can reorder operations Any other data type/node type conforming to these rules can reuse all our

auto-tuning effortsx

yA0

0

0

0

0

0

0

0



Computer Sciences

79

BERKELEY PAR LAB

AcknowledgmentsAcknowledgments

80



Acknowledgments

Berkeley ParLab Thesis Committee: David Patterson, Kathy Yelick, Sara McMains BeBOP group: Jim Demmel, Kaushik Datta, Shoaib Kamil, Rich Vuduc,

Rajesh Nishtala, etc… Rest of ParLab

Lawrence Berkeley National Laboratory FTG group: Lenny Oliker, John Shalf, Jonathan Carter, …

Hardware Donations and Remote Access Sun Microsystems IBM AMD FZ Julich Intel

This research was supported by the ASCR Office in the DOE Office of Science undercontract number DE-AC02-05CH11231, by Microsoft and Intel funding through award#20080469, and by matching funding by U.C. Discovery through award #DIG07-10227.



Computer Sciences

81

BERKELEY PAR LAB

Questions?Questions?

Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

Documents